Protein sequence databases

Bioinformatics Practical Manual

Kailash Chandra Samal Amrita Priyadarsini Gyana Ranjan Rout Anath Bandhu Das Iswar Ch. Mohanty

Department of Agricultural Biotechnology College of Agriculture Orissa University of Agriculture & Technology Bhubaneswar 2014


First Edition: 2014

©Copyright 2014 Orissa University of Agriculture and Technology, Bhubaneswar, Odisha All rights reserved. No part of this publication may be reproduced, stored in retrieval system or transmitted by any means electronic, mechanical, photocopying, recording or otherwise, without prior written permission of publisher

Printed at: Bhubaneswari Traders, Nayapalli, Bhubaneswar

Preface The idea of writing a Bioinformatics Practical Manual originated from our experience of teaching biotechnology and bioinformatics at Orissa University of Agriculture and Technology, Bhubaneswar. Odisha. The students are needed a write-up that was comprehensive enough to cover all major aspects in the field, technical enough and sufficiently up to date to include most current development while at the same time being logical and easy to understand. The student interest motivated us to write this bioinformatics manual to alleviate the problem. It is written specifically for the biotechnology and bioinformatics students where the basics of bioinformatics are explained. All key areas of bioinformatics are covered including biological databases, sequence alignment, gene and promoter prediction, molecular phylogenetics, structural bioinformatics, genomics, and proteomics. The manual emphasizes the different practical aspects of bioinformatics. Efforts have been made to include all essential aspects of Bioinformatics. It is hoped that this publication will help the teachers, students and technicians to upkeep their practical knowledge on various dimension of Bioinformatics. We are grateful to Prof. Manoranjan Kar, Vice Chancellor for his encouragement and valuable guidance in bringing out this publication. The constant encouragement and guidance of Prof. B.K. Mishra, Dean, College of Agriculture for preparation of this publication is duly acknowledged. The help of the ICAR in granting financial assistance for bringing out this publication is gratefully acknowledged.

Gyana Ranjan Rout Kailash Chandra Samal

Message-1

Visit us at: www.cabbsouat.org Tel & Fax : (0674) 2397375 (O), 2561458 (R) Fax: 0674-2397780

Email: [email protected] College of Agriculture ORISSA UNIVERSITY OF AGRICULTURE & TECHNOLOGY BHUBANESWAR 751003, ODISHA

Prof. B.K. Mishra Dated the 31st March, 2014

DEAN

Message Economic growth and development in India continues to be propelled by growth in agriculture and allied sectors. This can only be done through technological advancements and competent human resource to serve the needs of farmers. Today, the agricultural production through most conventional science and technology innovations has reached a plateau. Therefore, there is a need to break the plateau. Thus to put the country’s agricultural growth on fast track, development of cutting edge technologies such as Biotechnology and Bioinformatics are the need of the hour. Biotechnology is based on techniques involving genes, genomes, nucleic acids and other related macro and micro biomolecules. Bioinformatics apply computer based information technology for storage, retrieval and analysis of vast databases being generated on genes, genomics and nucleic acids. I am delighted that the Department of Agricultural Biotechnology, College of Agriculture, OUAT is going to publish “Bioinformatics Practical Manual” for UG, PG and Ph. D. students of agriculture. This publication will strengthen the knowledge of students, researchers and faculty members on various techniques in the areas of bioinformatics. I am confident that this manual will be very helpful for the students, researchers and faculty members.

Content Chapter / Exercise Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Exercise 1. Exercise 2. Exercise 3. Exercise 4. Exercise 5. Exercise 6. Exercise 7. Exercise 8. Exercise 9. Exercise 10. Exercise 11. Exercise 12. Exercise 13. Exercise 14. Exercise 15. Exercise 16 Exercise 17. Exercise 18. Exercise 19. Exercise 20.

Particulars Biological background Scope and application of bioinformatics Databases and its structure Biological database Database retrieval system Cataloging biological database Pairwise Sequence Alignment Multiple sequence alignment Practical exercises Making search for the scientific literature and sequences Characterization of a known Gene Finding out open reading frames (ORF) Translating an unknown DNA Sequence Identifying a gene using BLAST program Finding Domains in Protein Sequences Nucleotide BLAST (BLASTn) Protein BLAST (Blastp) Translated BLAST (Blastx) tBLASTX Position Specific Interacted BLAST (PSI-BLAST) FASTA Editing and analyzing multiple sequence alignment using Jalview Making multiple alignment with T-coffee Online Mendelian Inheritance in Man (OMIM) Protein Structure Database Depositing sequences in database Submitting sequences to Genbank through BankIt Submitting sequences to Genbank through ‘Sequin’ Primer Designing

Page No 1 16 28 33 47 49 54 57 58 58 61 65 68 71 74 76 79 81 83 85 88 91 93 96 98 99 100 103 107


K. C. Samal et al.

Chapter 1 Biological background Bioinformatics is a tool for providing insight into the structures and functions of biomolecules: DNA, RNA and proteins. In particular, bioinformatics deals with the task of understanding information chemically encoded into life that controls the structural processes ongoing in all living organisms. Bioinformatics is usually concerned with applying statistical and computational methods to analyze biological data obtained from wet lab experiments, sequencing projects or the simulation of protein-protein interactions and how this can help us in understand the evolution of organisms and biological processes. It also provides an insight of the Central Dogma of Molecular Biology which characterizes the mappings between different types of biopolymer (DNA, RNA and protein). Strictly speaking, the Central Dogma is a list of usual transitions between different biomolecules within an organism. The theory classifies three styles of `maps' between biopolymers as follows: (i) General transfers: (a) DNA to DNA (DNA replication) (b) DNA to RNA (transcription) (c) RNA to protein (translation) (ii) Special transfers: (a) RNA to RNA (RNA replication) (b) RNA to DNA (reverse transcription) (c) DNA to protein (direct translation) (iii) Unknown transfers: (a) Protein to RNA (b) Protein to DNA (c) Protein to protein General transfers are those that happen continuously in organisms, whereas special transfers happen rarely and often only in special situations. No unknown transfers are recorded to have happened, although prions, which can manipulate proteins, may be considered by some to affect protein to protein transfers. DNA to [1]


K. C. Samal et al.

RNA to protein as the study of this particular process yields the most practical applications in areas such as gene therapy. It is important to understand the nature and function of the biopolymers themselves and also the mechanisms connecting them and that will be the aim of this introduction.

Biopolymer: DNA DNA ((Deoxyribonucleic acid) is a helical linear biopolymer. DNA is a helixshaped molecule whose constituents are two parallel strands of nucleotides. There are four types of nucleotides in DNA and they correspond to the letters A (for adenine), T (thymine), C (cytosine) and G (guanine). DNA is usually represented by sequences of these four nucleotides. An A on one strand always pairs with a T on the other opposite strand through two hydrogen bonds, while a C always pairs with a G through three hydrogen bonds as these nitrogenous bases are complementary to each other. Thus, two strands are, therefore, complementary to each other and one helix starts from 5’ to 3’ direction while other helix starts from 3’ to 5’ directions. The sequential arrangement of the individual nucleotides is responsible for giving uniqueness to any individual living form be it humans, animals, plants, or microbes [2]

Bioinforrmatics Practical Manual

K. C. Saamal et al.

s is coonsidered; the secondd strand is always This asssumes that only one strand derivaable from the first by b pairing A’s with T’s T and C’s with G’s and vicee versa. That derivation is called finding f the reverse coomplementtary pair off a strand. mical basis of life thatt complexees with prooteins to foorm the DNA is the chem chrom mosomes. The T double helix struccture of DN NA (B form m) was disscovered byy James Watsoon and Fran ncis Crick 19953 and got Nobel prizee during 19662. The DN NA is spplit into disjoint intervals made m from m sequennces of nucleeotides. These intervaals are cateegorized ass either inttrons or exons, wheree exons are thhose parts of o the DNA A that are being b activvely transccribed into RNA and introns (or juunk DNA)) are thosee intervals that are not. n DNA can be coonsidered to t be a mostlly constan nt trait in a given multi-cellul m lar organissm, as it varies neggligibly betweeen the cellls under noormal condditions.

[33]


K. C. Samal et al.

Biopolymer: RNA Conversely, the transcribed RNA produced in an organism, though it is derived from DNA and is structurally similar (however, not helical), varies with regard to several factors, including time and environmental factors such as intracellular chemical gradients. The theory of how these extraneous factors affect the derivation of RNA from DNA is of specific importance to the bioinformatical projects. RNA preserves the information stored in DNA, as the nucleotides present in RNA `complement' the nucleotides of DNA, except that adenine in now complemented with uracil. Biopolymer: Protein Proteins are the active agents that govern the metabolic, structural and signaling processes at work in an individual organism. The translational map creating protein from mRNA (messenger RNA, the specific type involved in RNA to protein translation) is mostly determinable by the underlying mRNA and, in fact, each `codon' (or sequence of three RNA nucleotides) corresponds exactly to an amino acid (or a start codon, end codon or an untranslated triplet) - the constituent building blocks of proteins. Chromosomes and Genes Each chromosome is a long piece of DNA. Human has 46 chromosomes (2 sets of 23, one set from each parent) and contains 3.12 billion nucleotides (bases). Genes are just regions on that DNA. Genes are contiguous subparts of single stranded DNA that are templates for producing proteins. Genes can appear in either of the DNAs strands. The set of all genes in a given organism is called the genome for that organism. The function of DNA material between genes is largely unknown. Certain intergenic regions of DNA (called noncoding) are known to play a major role in cell regulation, the process that controls the production of proteins and their possible interactions with DNA. Proteins are produced from DNA using three operations or transformations called transcription, splicing, and translation. DNA is capable of replicating itself. The cell machinery that performs that task is called DNA-polymerase. Biologists call the capability of DNA for replication and [4]


K. C. Samal et al.

undergoing the above three (or two) transformations the central dogma. Genes are transcribed into pre-RNA by a complex ensemble of molecules called RNApolymerase. During transcription the nucleotide T (thymine) is substituted by another one designated by the letter U (for uracil). Pre-RNA can be represented by alternations of sequence segments called exons and introns. The exons represent the parts of pre-RNA that will be expressed, that is, translated into proteins. Next comes the operation called splicing; an ensemble of proteins called the spliceosome performs it. Splicing consists of concatenating the exons and excising the introns to form what is known as mRNA, or simply RNA. The final phase, called translation, is essentially a “table look-up” performed by complex molecules called ribosomes (an ensemble of RNA and proteins). Translation repeatedly considers a triplet of consecutive nucleotides in RNA and produces one corresponding amino acid. The triplet is called a codon. In RNA, there is one special codon called a start codon and a few others called the stop codons. An open reading frame (ORF) is a sequence of codons starting with a start codon and ending with an end codon. The ORF is thus the sequence of nucleotides that is used by the ribosome to produce the sequence of amino acids that makes up a protein. There are basically 20 amino acids but, in certain rare situations, others can be added to that list. Since there are 64 different codons and 20 amino acids, the “table look-up” for translating each codon into an amino acid is redundant in the sense that multiple codons can produce the same amino acid. The “table” used by nature to perform translation is called the genetic code. Due to the redundancy of the genetic code, certain nucleotide changes in DNA may not alter the resulting protein. Once a protein is produced, it folds (most of the time) into a unique structure in 3D space. In the 3D representation of a protein, one can distinguish three different types of components: α-helices, β-sheets and coils. The secondary structure of a protein is its sequence of amino acids, annotated to distinguish the boundaries of each component: helices, sheets, and coils. The tertiary structure of a protein is its 3D representation. The function of a protein is the way it participates with other proteins and molecules in keeping the cell alive and interacting with its environment. Function is closely related to tertiary structure. In functional genomics, one studies the function of all [5]


K. C. Samal et al.

the proteins of a genome. One of the important goals of bioinformatics is to help biologists in deciphering the function of proteins. Genes and Proteins Most genes code for proteins: each gene contains the information necessary to make one protein. Proteins are the most important type of macromolecule. Proteins are of different types- Structural protein: collagen in skin, keratin in hair, crystallin in eye. Enzymes protein: all metabolic transformations, building up, rearranging, and breaking down of organic compounds, are done by enzymes, which are proteins. Transport protein: oxygen in the blood is carried by hemoglobin, everything that goes in or out of a cell (except water and a few gasses) is carried by proteins. Also: nutrition protein (egg yolk), hormones, defense, movement. Open Reading Frames Since codons consist of 3 bases, there are 3 “reading frames” possible on an RNA (or DNA), depending on whether you start reading from the first base, the second base, or the third base. The different reading frames give entirely different proteins. Consider ATGCCATC, and refer to the genetic code. (X is junk) Reading frame 1 divides this into ATG-CCA-TC, which translates to Met-Pro-X. Reading frame 2 divides this into A-TGC-CAT-C, which translates to X-Cys-His-X. Reading frame 3 divides this into AT-GCC-ATC, which translates to X-Ala-Ile Each gene uses a single reading frame, so once the ribosome gets started, it just has to count off groups of 3 bases to produce the proper protein. Another example of reading frames are shown below.

[6]


K. C. Samal et al.

For findings a gene, first job is to find long ORFs, examining the longest ORFs first and putting together a set with minimal overlaps. It is also necessary to identify potential start codons, with the furthest upstream start codon as the easiest choice. Then, how do we know that the ORF contains a real gene? The most definitive way is to match it with a gene known from other species conservation of a sequence between species strongly suggests that the sequence has a function that is being conserved by natural selection We compare protein sequences, not DNA, because protein is more conserved in evolution than DNA The organism’s survival depends on the protein being functional, which means having the proper amino acids sequence Since the genetic code is degenerate, many different DNA sequences will give identical proteins. The protein 3-dimensional structure is even more conserved, because it is more closely related to enzyme activity than the amino acid sequence is. However, we don’t have good ways of determining 3-D structure from a DNA sequence.

[7]


K. C. Samal et al.

Genetic Code Proteins are long chains of amino acids. There are 20 different amino acids coded in DNA. There are only 4 DNA bases, so you need 3 DNA bases to code for the 20 amino acids 4 x 4 x 4 = 64 possible 3 base combinations (codons). Each codon codes for one amino acid. Most amino acids have more than one possible codon. Genes start at a start codon and end at a stop codon. Three codons are stop codons. All genes end at a stop codon. Start codons are a bit trickier, since they are used in the middle of genes as well as at the beginning in eukaryotes, ATG is always the start codon, making Methionine (Met) the first amino acid in all proteins (but in many proteins it is immediately removed). In prokaryotes, ATG, GTG, or TTG can be used as a start codon. Gene Expression How do you get a protein from a gene? A two-step process (called the Central Dogma of Molecular Biology). First, the gene has to be copied (transcribed) into an RNA form. The RNA copy (messenger RNA) is exactly like the gene itself, except RNA replaces T with U. The RNA is translated into protein by ribosomes, which are complex RNA/protein hybrid machines. With the help of transfer RNA molecules, which have one end that matches the 3 base codon and the other end that is attached to the proper amino acid. The ribosome starts at the start codon and moves down the messenger RNA, adding one amino acid at a time to the growing chain. When the ribosome reaches a stop codon, it falls off, releasing the new protein. [8]


K. C. Samal et al.

Transcription (Nucleus): ¾ In the nucleus, an enzyme called DNA helicase causes the twisted DNA molecule to unwind. ¾ One strand of the DNA is used as the template strand for RNA synthesis. ¾ RNA polymerases begins synthesizing RNA from the DNA template at the promoter sequence (a sequence that lets the RNA polymerase know where to begin). ¾ When RNA is synthesized, it is called mRNA (messenger RNA) and leaves the nucleus and goes to the cytoplasm. Translation (Cytoplasm): ¾ In the cytoplasm, rRNA (ribosomal RNA), which consists of a small and large subunit, comes together to provide a site for translation to occur. ¾ tRNA (transfer RNA) is the RNA responsible for picking up which amino acid should be added to the chain next. ¾ mRNA, rRNA, and tRNA all come together to perform translation. [9]


K. C. Samal et al.

¾ mRNA codes for a specific amino acid, tRNA retrieves that amino acid, and rRNA provides a surface for this to occur. ¾ When tRNA brings back the correct amino acid, a polypeptide chain is started. ¾ One amino acid is added at a time, and they are connected with peptide bonds. ¾ When the chain is finished, a protein is formed.

Genetic marker A DNA polymorphism that can be easily detected by molecular or biochemical analysis. The marker can be within a gene or in DNA with no known function. Because DNA segments that lie near each other on a chromosome tend to be inherited together, markers are often used as indirect ways of tracking the inheritance pattern of a gene that has not yet been identified, but whose approximate location is known. Primer A short (single strand) oligonucleotide sequence of 10-15 nucleotides used in a polymerase chain reaction (PCR) PCR The development of the polymerase chain reaction (PCR) was a technological breakthrough by Kary Mullis in 1985 who got the Nobel Prize during 1993. The principle of PCR is very simple. It is based on the function of a copying enzyme, Taq DNA polymerase (obtained from a bacteria Tharmus acuaticus, a microbial habitat of hot spring), which is able to synthesize a duplicate molecule of DNA from a DNA template which is bracketed by the primer. The product of duplication of the original template DNA becomes a second template for another round of duplication. Repeated duplications thus lead to an exponential increase in DNA product accumulation. Even when starting from a single DNA molecule, detectable amounts of target DNA are generated by PCR in a few hours. DNA polymerase was first isolated from Thermus aquaticus in 1976. In 1989 Science [10]


K. C. Samal et al.

magazine named Taq polymerase as its first "Molecule of the Year". In 1993, Dr. Mullis was awarded the Nobel Prize for his work with PCR. DNA fingerprinting A technique used by scientists to distinguish between individuals of the same species using only samples of their DNA. It is a technique, by which an individual can be identified at molecular level. With the advancement of science and technology VNTR (Variable Number of Tandem Repeats) and STR (Short Tandem Repeats) analysis has become very popular in forensic laboratories. The process of DNA fingerprinting was invented by Alec Jeffreys at the University of Leicester in 1985 in England and was knighted in 1994. Scientists have chosen repeating sequences in the DNA, which are present in all individuals on different chromosomes, and are known to vary from individual to individual. These are used as genetic markers to identify the individual. DNA fingerprinting technique has been successfully used for identification of plant species or cultivar, detection seed purity, detection of adulteration in food and seed and other planting material. This technique also resolves disputes of maternity /paternity, identification of cultivars or breeding material, forensic wildlife, protection of farmers’ rights and biodiversity. This remarkable technology provides positive identification with virtually 100% precision. DNA profile of an individual is unique. It can never be identical even in biologically related individuals except for the identical (monozygotic) twins. The chances of two people having exactly the same DNA profile are 30,000 million to 1 (except for identical twins). Any biological material such leaf, seed, plant parts in case of plant and a drop of blood, saliva, semen, and any body part such as bones, tissue, skull, teeth, hair with root in case of animal and human being. Molecular markers Molecular markers in life sciences is defined as a DNA sequence or a cytogenetic segment or a chromosome fragment or a protein or an enzyme used as [11]


K. C. Samal et al.

an experimental probe to keep track of an individual, a tissue, a cell, a nucleus, a chromosome or a gene. In general, different types of markers are used in life sciences and uses of molecular markers are given below. ¾ Assessment of genetic variability and fingerprinting of genotypes ¾ Mapping of monogenic and qualitative trait loci (QTL) of economically important traits ¾ Estimation of genetic distance or degree of relatedness between population, inbreds and breeding materials or among groups of accessions in germplasm ¾ Identification of sequences for candidates genes and economic breeding traits ¾ Marker assisted selection for crop improvement in tissue cultured plant species ¾ Genetic purity testing of seeda and micro-propagated plantlets ¾ Characterization and evaluation of plant genetic resources and its conservation ¾ Screening transgenic plants for resistance genes using linked molecular markers The different molecular markers are (a) Restriction fragment length polymorphisms (RFLPs) Restriction fragment length polymorphisms (RFLPs) are identified using restriction enzymes that cleave the DNA only at precise “restriction sites” (e.g. EcoRI cleaves at the site defined by the palindrome sequence GAATTC). At present, the most frequent use of RFLPs is downstream of PCR (PCR–RFLP), to detect alleles that differ in sequence at a given restriction site. A gene fragment is first amplified using PCR, and then exposed to a specific restriction enzyme that cleaves only one of the allelic forms. The digested amplicons are generally resolved by electrophoresis. Advantages ¾ RFLPs are co-dominant and can differentiate homozygote. [12]

heterozygote from


K. C. Samal et al.

¾ It is more sensitive and most reliable marker technique. ¾ It can able to identify a unique locus Disadvantage: ¾ The technique is laborious, costly and involves several time consuming, tedious steps. ¾ The detection system uses radioisotope or complex biochemistry. ¾ It requires large amount of high quality DNA. ¾ It requires species specific primers/ probes. ¾ It is not suitable for high scale analysis of varieties/ genomes. ¾ Automation is not possible (b) RAPD marker RAPD (random amplification of polymorphic DNA) is a PCR-based method which employs short synthetic oligo-nucleotides (10 – 12 bases long) of random sequences as primers to amplify DNA fragments from genomic template DNA under low annealing temperatures. Amplification products are generally separated on agarose gels and stained with ethidium bromide. The amplified DNA fragments have been visually scored and used for different analysis. Advantage ¾ The RAPD technique is simple, cost effective. ¾ The procedure requires very small amounts of DNA and don't require cloning or prior knowledge of sequence of genome. ¾ Same primer can be used across the genome. ¾ Suitable for large scale analysis of genotypes ¾ Automation is possible and requires no radioactivity. Disadvantage ¾ RAPDs are commonly dominant markers. The heterozygote can’t be differentiated from homozygote. ¾ RAPD is less reliable.

[13]


K. C. Samal et al.

(c) Microsatellites or SSR Microsatellites or SSR (Simple Sequence Repeats) or STR (Simple Tandem Repeats) consist of a stretch of DNA a few nucleotides long – 2 to 6 base pairs (bp) – repeated several times in tandem (e.g. CACACACACACACACA). They are spread over a eukaryote genome. Microsatellites are of relatively small size, and can, therefore, be easily amplified using PCR from DNA extracted from a variety of sources including blood, hair, skin or even faeces. Polymorphisms can be visualized on a sequencing gel, and the availability of automatic DNA sequencers allows high-throughput analysis of a large number of samples (Goldstein and Schlötterer, 1999; Jarne and Lagoda, 1996). Microsatellites are hypervariable; they often show tens of alleles at a locus that differ from each other in the numbers of the repeats. They are still the markers of choice for diversity studies as well as for parentage analysis and Quantitative Trait Loci (QTL) mapping, although this might be challenged in the near future with the development of cheap methods for the assay of SNPs. FAO has published recommendations for sets of microsatellite loci to be used for diversity studies for major livestock species, which were developed by the ISAG–FAO Advisory Group on Animal Genetic Diversity (see DAD-IS library http://www.fao.org/dad-is/). Advantage ¾ The technique is simple ¾ It requires little DNA, faster and cost effective. ¾ Microsatellite markers are co-dominant. ¾ These markers are abundant, distributed evenly throughout the genome, show high level of polymorphism compared to other marker. ¾ It is useful especially for analyzing closely related genotypes. ¾ Suitable for large scale analysis of genotypes. Disadvantage ¾ It requires species specific primers ¾ The technique requires development of marker. ¾ The cost of microsatellite markers is high. [14]


K. C. Samal et al.

(d) Amplified fragment length polymorphisms (AFLPs) Amplified Fragment Length Polymorphism is a molecular marker generated by a combination of restriction digestion and PCR amplification. Advantage ¾ AFLPs are highly polymorphic, evenly distributed throughout the plant genome and hence serve as useful tool for various genetic studies. ¾ It is suitable for large scale analysis of genotypes. ¾ The technique can be used for DNA of any origin or complexity ¾ It combines the advantages of both RFLP and RAPD Disadvantage ¾ AFLPs are mostly dominant in nature and hence heterozygote can’t be differentiated from homozygote. ¾ Requires high quality DNA. ¾ Procedure is little bit complex and requires careful handing (e) Sequence Tagged Site (STS) STS (Sequence Tagged Site) are DNA sequences that occur only once in a genome, in a known position. They needn’t be polymorphic and are used to build physical maps. (f) Single Nucleotide Polymorphism (SNP) SNPs are variations at single nucleotides which do not change the overall length of the DNA sequence in the region. SNPs occur throughout the genome. They are highly abundant. Most SNPs are located in non-coding regions, and have no direct impact on the phenotype of an individual. However, some introduce mutations in expressed sequences or regions influencing gene expression (promoters, enhancers), and may induce changes in protein structure or regulation. These SNPs have the potential to detect functional genetic variation.

[15]


K. C. Samal et al.

Chapter 2 Scope and application of bioinformatics Bioinformatics is the field of science in which biology, computer science and information technology merges to form a single discipline. It is the collection, organization, analysis, presentation and sharing of biological data to solve biological problems on the molecular level. It is an interdisciplinary scientific field that develops methods for storing, retrieving, organizing and analyzing biological data. A major activity in bioinformatics is to develop software tools to generate useful biological knowledge. Bioinformatics uses many areas of computer science, statistics, mathematics and engineering to process biological data. Databases and information systems are used to store and organize biological data. Analyzing biological data may involve algorithms in artificial intelligence, soft computing, data mining, image processing, and simulation. The algorithms in turn depend on theoretical foundations such as discrete mathematics, control theory, system theory, information theory, and statistics. Commonly used software tools and technologies in the field include Java, C#, XML, Perl, C, C++, Python, R, SQL, CUDA, MATLAB, and spreadsheet applications. The term bioinformatics was coined by Pauline in 1979 for the study of information processes in biotic systems. The National Center for Biotechnology Information (NCBI, 2001) defines bioinformatics as “Bioinformatics is the field of science in which biology, computer science and information technology merges to form a single discipline. There are three important sub disciplines within bioinformatics: the development of new algorithms and statistics with which to access relation among member of large data sets, the analysis and interpretation of various type of data including nucleotide and amino acid sequences, protein domain and protein structure, and development and interpretation of tools that enable efficient access and management of different type of information”. Bioinformatics is a science discipline that has been emerged in response to accelerating demand for a flexible and intelligent means of storing, managing and querying large and complex biological data sets. The ultimate aim of [16]


K. C. Samal et al.

bioinformatics is to enable the discovery of new biological insight as well as to create a global perspective form which unifying principles in biology can be discerned. Over the past decade rapid developments in bioinformatics technologies have combined to produce a tremendous amount of information related to molecular biology. At the beginning of genomic revolution, the main concern of bioinformatics was the creation and maintenance of a database to store biological information such as nucleotide and amino acid sequence. Development of this type of database involved not only to design issue but development of the interface where by researchers could both access existing data as well as submit or revised data (e.g to the NCBI, http;//www.ncbinlm nih.gov/). More recently, emphasis has shifted towards the analysis of large data sets, particularly those stored in different formats in different databases. Ultimately, all of this information must be combined to form a comprehensive picture of normal cellular activities so that researcher may study how these activities are altered in different disease states. Therefore, the field of bioinformatics has evolved such that most pressing task, now introduced to analysis and interpretation of various types of data, including nucleotide and amino acid sequence, protein domain and protein structure. Origin and history of Bioinformatics Gregore Mendel, ‘father of Genetics’, illustrated that the inheritance of traits is controlled by factor passed down from generation to generation. After this discovery of Mendel, bioinformatics and genetic record keeping have come long way. The understanding of genetics have advance remarkably in the last thirty years. In 1972, Paul Berg made the first recombinant DNA molecules using ligase. In that same year, Stanley Cohen, Annie Chang, Robert Helling and Herbert Boyer showed that extra-chromosomal bits of DNA called plasmids act as vectors for maintaining cloned genes in bacteria. The discovery is a major breakthrough for genetic engineering, allowing for such advances as gene cloning and the modification of genes. In 1973, two important things happened in the field of genomics. Joseph Sambrook led a team that refined DNA electrophoresis using agarose gel, and Herbert Boyer and Stanely Cohen introduced DNA cloning. In 1977, and method for sequencing DNA was discovered and the first genetic [17]


K. C. Samal et al.

engineering company, Genetech was founded. During 1981, 579 genes had been mapped and mapping by in situ hybridization had become a standard method for automated DNA sequencing. In 1988, the human genome organization (HUGO) was founded. This is an international organization of scientist involved in Human genomic project. In 1989, the first complete genome map was published of bacteria Himophilus influenza. The following year, human Genome project was started in 1991. A total of 1879 human genes had been mapped. In 1993, Genethon, a human genomic research center in France produced a physical map of human genome. Three year later, Genethon published the final version of the Human genetic map which included data from patients, preclinical and clinical trials and metabolic pathway of numerous species. Challenges: The greatest challenge facing the molecular biology community today is to make sense of the wealth of data that has been produced by the genome sequencing projects. Cells have central core called nucleus, which is storehouse of an important molecular known as the genome. Gene are specific region of the genome (about 1%) spread through genome, sometime contiguous, many times non contiguous. RNA similarly contain information, their major purpose is to copy information from DNA selectively and to bring it out of the nucleus for its use. Protein is made of amino acids, which are twenty in count. The gene, regions of the DNA in the nucleus of the cell, is copied into the RNA and RNA travels to protein production sites and is translated into protein is the Central dogma of molecular biology. Difference between bioinformatics and computational biology Both bioinformatics and computational biology are computers and biology. Biologists specialize in use of computational tools and systems to answer problems of biology are bioinformaticians. Computer scientist, mathematicians, statisticians and engineers who specialize in developing theories, algorithms and technique for such tools and systems are computational biologists. The actual process of

[18]


K. C. Samal et al.

analyzing and interpreting data is referred to as computational biology. Important some discipline with bioinformatics and computational biology include: Bioinformatics has become a mainstay of genomics, proteomics and all others omics (such as phonemics) and many information technology companies have entered the business which creats an IT and BT convergence. A bioinformatician is an expert, who knows how to use bioinformatics tools, but also knows how to write interfaces for effective use of the tools. A computational biologist, on the other hand, is trained individual who only knows to use bioinformatics tools without a deeper understanding. Application of bioinformatics Bioinformatics, an emerging area offering a fundamental tool to the scientific community, aim to speed up the research, application and commercialization of Biotechnology. It is the marriage between biotechnology and information technology leading to the growth and development of this field. The genomic revolution speeds up the central role of bioinformatics in understanding the very basic of life processes. Over the last decade, biologist have handled a number of genome research projects that include DNA sequencing, Proteomics, expression studies and metabolomics. The completely sequenced genomes are Human (Homo sapiens), mouse (Mus musculus), insect (Drosophila melanogaster), plant (Arabidopsis thaliana), yeast (Saccharomyces cerevisiae), bacteria (Escherichia coli, Vibrio cholerae), worm (Caenorhabditis elegans) and their complete sequence information have stored in various public databases. Results from genomic study now become immensely important in biological and medical research. Therefore vast amount of biological information need to be stored, organized and indexed so that the information can be retrieved and used. Bioinformatics emphasize the multidisciplinary nature of the field and also convey the nature of bioinformatics applications. Bioinformatics is becoming increasingly important due to the interest of the pharmaceutical industry in genome sequencing projects. There is a vital need to harness this information for the medical diagnostic and therapeutic uses and there are opportunities for other [19]


K. C. Samal et al.

industrial applications. This field is evolving rapidly, which makes it challenging for biotechnology professionals to keep up with recent advancements. The area has evolved to deal with four distinct problem viz, (a) Handling and management of biological data, including its organization, control, linkages, analysis and so forth. (b) Communication among people projects and institutions engaged in biological research and applications. (c) Organization, access, search and retrieval of biological information, documents and literature. (d) Analysis and interpretation of the biological data through the computational approaches. Bioinformatics has wide application in the following areas 1. Storage and organization of data Bioinformatics is used to organized biological data to help the researches to access information, add new information arising out of experiments and modify existing information. For example protein Data Bank (PDB) http://www.rcsb.org/pdb/) is the single worldwide repository for the processing and distribution of 3-D biological macro molecular structure data. 2. Information Search and Retrieval:Information Search and retrieval is one of the most powerful applications of bioinformatics. For example, Pubmed (http://www.ncbi.nlm.nih.gov/pubmed/) is a service of the National Library of Medicine. It includes links to many sites providing full text articles and other related resources. 3. Sequence Comparison:One of the most useful and popular applications for the biologists is the sequence comparison/ sequence alignment. BLAST and FASTA are two online which can perform pair-wise comparison of sequences.

[20]


K. C. Samal et al.

Multiple sequence alignment methods assemble pair wise sequence alignments for many related sequence in to a picture of sequence homology among all members of a gene family. Multiple sequence alignments aid in visual identification of sites on a DNA or proteins that may be functionally important. Those sites are usually conserved. 4. Linkage Analysis Genological research and linkage analysis involves the analysis of a large amount of data. Chromosomal location of genes this has important implications in disease identification can be identified using linkage analysis. There are various on line tools (http://linkage.rockefeller.edu/) are used for linkage analysis. 5. Comparative Genomics The assumption that the similarity of two sequences whether it is DNA,RNA or Protein implies functional correlation. Some of the most successful bioinformatics applications is the sequence. alignment against large databases of known sequence using online BLAST tools. 6. Functional Genomics To investigate genes in their cellular context, expression analysis via microarray and DNA-chips takes place.The comparision of expression patterens of well defined metabolic states allows identifications of pathological phenotypes on a molecular level. 7. Proteomics The proteome refers to the identification and analysis of all proteins of a cell. It involves the determination of protein interactions and biological pathways. The publication of entire genome sequences led to a shift interests from DNA Sequencing to protein localization and characterization within their cellular context. 8. Structural Genomics Structural Genomics covers the calculation of three dimensional structures based on the sequence of a macromolecule. The theoretical basis of the relationship [21]


K. C. Samal et al.

between sequences and structure is the most fundamental problem of in silico biology. The only knowledge of the structure of a protein can provide a deeper understanding of its function. 9. Pharmaco Genomics The development of drugs aims to maximize effect and minimize side effects. The genetic variations among all human is only 0.1% of the total DNA. This is called point mutation having phenotypic impact. These so. called SNPs become good candidates for drug development and diagnosis. 10. Cellomics or system Biology: If sufficient data is available and all relevant components for life are identified more complex interactions can be investigated. For a holistic biological understanding of cell, simulations of cells, entire organisms and population provide new insights. The simulation of life in in-silico is a future directive for bioinformatics that started now. 11. Phylogenetic Analysis: Phylogenetic analysis attempts to describe the evolutionary relationship of a group of sequences. The information in a molecular sequence alignment can be used to compute a phylogenetic trees for a particular family of gene sequence. The branching in phylogenetic tree represents the evolutionary based on sequence similarity. Phylogenetic analysis of protein sequence families takes about the evolution of entire organisms. 12. Primer Design Many molecular biology protocols require the design of oligo-nucleotide primers. Proper primer design is critical for the success of polymerase chain reaction(PCR), oligo-hybridization, DNA sequencing and microarray experiments primers must hybridize with the target DNA and in addition to it the primers have following qualities: appropriate physico –chemical properties; they must not self hybridize or dimerize; they should not have multiple targets within sequence under investigation . [22]


K. C. Samal et al.

13. Constructing Evolutionary (Phylogenetic) Trees: Biodiversity database are used to collect the species names, descriptions, distribution, genetic information, status, and size of populations, habitat needs and how each organism interacts with other species etc. Computer simulations models are useful to study population dynamics or calculate the cumulative genetic health of a breeding pool( in agriculture) or endangered population(in conservation). Entire DNA sequences or genome of endangered species can be preserved, allowing results of natures experiment to be remembered in silicon There are two areas in biology where enormous amounts of information are generated. One is in molecular biology which deals with base sequences in DNA and amino acid sequences in protein and the other is the biodiversity information crisis. Mathematics and computer are being used to tackle these problems with procedures which come under the label of bioinformatics. These trees are often constructed after comparing sequences belonging to different organisms. Trees group the sequences according to their degree of similarity. They serve as a guide to reasoning about how these sequences have been transformed through evolution. For example, they infer homology from similarity, and may rule out erroneous assumptions that contradict known evolutionary processes. 14. Detecting Patterns in Sequences: There are certain parts of DNA and amino acid sequences that need to be detected. Two prime examples are the search for genes in DNA and the determining of subcomponents of a sequence of amino acids (secondary structure). There are several ways to perform these tasks. Many of them are based on machine learning and include probabilistic grammars, or neural networks. 15. Determining 3-D Structures from Sequences: The problems in bioinformatics that relate sequences to 3D structures are computationally difficult. The determination of RNA shape from sequences requires algorithms of cubic complexity. The inference of shapes of proteins from amino acid sequences remains an unsolved problem. [23]


K. C. Samal et al.

16. Inferring Cell Regulation: The function of a gene or protein is best described by its role in a metabolic or signaling pathway. Genes interact with each other; proteins can also prevent or assist in the production of other proteins. The available approximate models of cell regulation can be either discrete or continuous. One usually distinguishes between cell simulation and modeling. The latter amounts to inferring the former from experimental data (say microarrays). This process is usually called reverse engineering. 17. Determining Protein Function and Metabolic Pathways: This is one of the most challenging areas of bioinformatics and for which there is not considerable data readily available. The objective here is to interpret human annotations for protein function and also to develop databases representing graphs that can be queried for the existence of nodes (specifying reactions) and paths (specifying sequences of reactions). 18. Assembling DNA Fragments: Fragments provided by sequencing machines are assembled using computers. The tricky part of that assemblage is that DNA has many repetitive regions and the same fragment may belong to different regions. The algorithms for assembling DNA are mostly used by large companies (like the former Celera). (8) Using Script Languages. Many of the above applications are already available in websites. Their usage requires scripting that provides data for an application, receives it back, and then analyzes it. The algorithms required to perform the above tasks are detailed in the following subsections. What differentiates bioinformatics problems from others is the huge size of the data and its (sometimes questionable) quality. That explains the need for approximate solutions. It should be remarked that several of the problems in bioinformatics are constrained optimization problems. The solution to those problems is usually computationally expensive. One of the efficient known methods in optimization is dynamic programming. That explains why this technique is often used in bioinformatics. Other approaches like branch and- bound are also used, but they are known to have higher complexity [24]


K. C. Samal et al.

19. Drug designing: It has applications in knowledge-based drug design. Computational studies of protein–ligand interactions provide a rational basis for the rapid identification of novel leads for synthetic drugs. Knowledge of the three-dimensional structures of proteins allows molecules to be designed that are capable of binding to the receptor site of a target protein with great affinity and specificity. This informatics-based approach significantly reduces the time and cost necessary to develop drugs with higher potency, fewer side effects, and less toxicity than using the traditional trialand-error approach. In the last two decades, tens of thousands of protein three dimensional structure have been determined by X –ray crystallography and protein nuclear magnetic resonance spectrograph (protein in NMR). One central question for the biological scientists is whether it is practical to predict possible protein in protein interaction only based on these 3D shapes, without doing protein in protein interaction experiments. A variety of method have been developed to tackle the protein – protein docking problem, though it seems that there is still much work to be done in this field. We are interested in information about our DNA, protein and the function of proteins. Genes and proteins can be sequenced, so the sequence of bases in genes or amino acids in proteins can be determined. This information must be store in an intelligent fashion, so scientists can solve problems quickly and easily using all available information. Therefore, the information is stored in databanks, many of which are accessible to everyone on the internet. A few examples are a databank containing protein sequences and their function(the PDB or protein data bank), a databank containing protein sequences and their function(swiss-prot),a data bank with information about enzymes and their function (enzyme),and a databank with nucleotide sequences of all genes sequenced up to date(EMBL). 20. Human health care and Forensic science In forensics, results from molecular phylogenetic analysis have been accepted as evidence in criminal courts. Some sophisticated Bayesian statistics and likelihood-based methods for analysis of DNA have been applied in the analysis of forensic identity. It is worth mentioning that genomics and bioinformatics are now [25]


K. C. Samal et al.

poised to revolutionize our healthcare system by developing personalized and customized medicine. The high speed genomic sequencing coupled with sophisticated informatics technology will allow a doctor in a clinic to quickly sequence a patient’s genome and easily detect potential harmful mutations and to engage in early diagnosis and effective treatment of diseases. 21. Agriculture: Bioinformatics tools are being used in agriculture as well. Plant genome databases and gene expression profile analyses have played an important role in the development of new crop varieties that have higher productivity and more resistance to disease.

Bioinformatics in India As per the recent study India will be a potential star in the field of bioscience. In the coming years after considering the factors like bio-diversity, human resources, and infra-structure facilities and governments initiatives. Bioinformatics has been emerged out of the inputs from several different areas such as biology, biochemistry, biophysics, molecular biology, biostatics and computer science. Specially designed algorithms and organized database is the core of all informatics operations. The requirements for such an activity make heavy and high level demands on both the hardware and software capabilities. This sector is the quickest growing field in the country. The vertical growth is because of the linkage between IT and biotechnology, spurred by the human genome project. The promising startups are already there in Bangalore, Hyderabad, Pune, Chennai and Delhi. There are over 200 companies functioning in these places. IT majors such Intel, IBM, Wipro are getting into this segments spurred by the promises in technological developments.

Limitations Having recognized the power of bioinformatics, it is also important to realize its limitations and avoid over-reliance on and over-expectation of bioinformatics output. In fact, bioinformatics has a number of inherent limitations. In many ways, the role of bioinformatics in genomics and molecular biology [26]


K. C. Samal et al.

research can be likened to the role of intelligence gathering in battlefields. Intelligence is clearly very important in leading to victory in a battlefield. Fighting a battle without intelligence is inefficient and dangerous. Having superior information and correct intelligence helps to identify the enemy’s weaknesses and reveal the enemy’s strategy and intentions. The gathered information can then be used in directing the forces to engage the enemy and win the battle. However, completely relying on intelligence can also be dangerous if the intelligence is of limited accuracy. Overreliance on poor-quality intelligence can yield costly mistakes if not complete failures. It is no stretch in analogy that fighting diseases or other biological problems using bioinformatics is like fighting battles within diligence. Bioinformatics and experimental biology are independent, but complementary, activities. Bioinformatics depends on experimental science to produce raw data for analysis. It, in turn, provides useful interpretation of experimental data and important leads for further experimental research. Bioinformatics predictions are not formal proofs of any concepts. They do not replace the traditional experimental research methods of actually testing hypotheses. In addition, the quality of bioinformatics predictions depends on the quality of data and the sophistication of the algorithms being used. Sequence data from high throughput analysis often contain errors. If the sequences are wrong or annotations incorrect, the results from the downstream analysis are misleading as well. That is why it is so important to maintain a realistic perspective of the role of bioinformatics.

[27]


K. C. Samal et al.

Chapter 3 Databases and its structure One of the hallmarks of modern genomic research is the generation of enormous amounts of raw sequence data. As the volume of genomic data grows, sophisticated computational methodologies are required to manage the data deluge. Thus, the very first challenge in the genomics era is to store and handle the staggering volume of information through the establishment and use of computer databases. The development of databases to handle the vast amount of molecular biological data is thus a fundamental task of bioinformatics. This chapter introduces some basic concepts related to development and management of databases. Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics. Information contained in biological databases includes gene function, structure, localization (both cellular and chromosomal), clinical effects of mutations as well as similarities of biological sequences and structures.

What is a database? A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Databases are composed of computer hardware and software for data management. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Each record, also called an entry, should contain a number of fields that hold the actual data items, for example, fields for names, phone numbers, addresses, dates. To retrieve a particular record from the database, a user can specify a particular piece of information, called value, to be found in a particular field and expect the computer to retrieve the whole data record. This process is called making a query. Although data retrieval is the main purpose of all databases, biological databases often have a higher level of requirement, known as knowledge discovery, which refers to the identification of connections between pieces of information that were not known [28]


K. C. Samal et al.

when the information was first entered. For example, databases containing raw sequence information can perform extra computational tasks to identify sequence homology or conserved motifs. These features facilitate the discovery of new biological insights from raw data.

Organization of databases: Databases can be constructed either as flat files, relational, or object oriented. Flat files are simple text files and lack any form of organization to facilitate information retrieval by computers. Relational databases organize data as tables and search information among tables with shared features. Object-oriented databases organize data as objects and associate the objects according to hierarchical relationships. (a) Flat file database: Originally, databases all used a flat file format, which is a long text file that contains many entries separated by a delimiter, a special character such as a vertical bar (|). Within each entry are a number of fields separated by tabs or commas. Except for the raw values in each field, the entire text file does not contain any hidden instructions for computers to search for specific information or to create reports based on certain fields from each record. The text file can be considered a single table. Thus, to search a flat file for a particular piece of information, a computer has to read through the entire file, an obviously inefficient process. This is manageable for a small database, but as database size increases or data types become more complex, this database style can become very difficult for information retrieval. Indeed, searches through such files often cause crashes of the entire computer system because of the memory-intensive nature of the operation. To facilitate the access and retrieval of data, sophisticated computer software programs for organizing, searching, and accessing data have been developed. They are called database management systems. These systems contain not only raw data records but also operational instructions to help identify hidden connections among data records. The purpose of establishing a data structure is for easy execution of the searches and to combine different records to form final search reports. Depending on the types of data structures, these database management systems can [29]


K. C. Samal et al.

be classified into two types: relational database management systems and objectoriented database management systems. Consequently, databases employing these management systems are known as relational databases or object-oriented databases, respectively. Name, State Course#, Course name, Dhawale Rahmi, Maharastra, PPT-301, Plant pathology, Gite Vikram Balaji, Bihar, ABT-510, Bioinformatics, Nihar Ranjan, Odisha, ST-512, Statistics, Hinge Shyam A,Rajastan, PBG-612, Plant breeding, Thirat Suital Bansi, Maharastra, ABT-517, Microbiology, Surve Ratnapal, Jharkhand, PP-621, Plant physiology, Kalapad santosh, Keral , ENT-614, Entomology, Kumara Swamy, Tamilnadu, AC-524, Figure: Example of constructing a Flat file database for eight students’ course information

(b) Relational Databases Instead of using a single table as in a flat file database, relational databases use a set of tables to organize data. Each table, also called a relation, is made up of columns and rows. Columns represent individual fields. Rows represent values in the fields of records. The columns in a table are indexed according to a common feature called an attribute, so they can be cross-referenced in other tables. To execute a query in a relational database, the system selects linked data items from different tables and combines the information into one report. Therefore, specific information can be found more quickly from a relational database than from a flat file database. Relational databases can be created using a special programming language called structured query language (SQL). The creation of this type of databases can take a great deal of planning during the design phase. After creation of the original database, a new data category can be easily added without requiring all existing tables to be modified. The subsequent database searching and data gathering for reports are relatively straightforward. Here is a simple example of student course information expressed in a flat file which contains records of five students from four different states, each taking a different course. Each data record, separated by a vertical bar, contains four fields describing the name, state, course [30]


K. C. Samal et al.

number and title. A relational database is also created to store the same information, in which the data are structured as a number of tables. In each table, data that fit a particular criterion are grouped together. Different tables can be linked by common data categories, which facilitate finding of specific information Relational database Table A Student Name No#

Table B State

Table C

Student Course# No#

Course#

PPT-301

PPT-301

Course name Plant pathology

1

Dhawale Rahmi

Maharashtra

2

Gite Vikram Balaji

Bihar

2

ABT-510

ABT-510 Bioinformatics

3

Nihar Ranjan

Odisha

3

ST-512

ST-512

4

Hinge Shyam A

Rajasthan

4

PBG-612

PBG-612 Plant breeding

5

Thirat Suital Bansi

Maharashtra

5

ABT-517

ABT-517 Microbiology

6

Surve Ratnapal

Jharkhand

6

PP-621

PP-621

7

Kalapad santosh

Kerala

7

ENT-614

ENT-614 Entomology

8

Kumara Swamy

Tamil Nadu

8

AC-524

AC-524

1

Statistics

Plant physiology

Soil chemistry

Figure Example of constructing a relational database for eight students’ course information originally expressed in a flat file. By creating three different tables linked by common fields, data can be easily accessed and reassembled

For example, if one is to ask the question, which courses are students from the state ‘Maharashtra’ taking? The database will first find the field for “State” in Table A and look up for ‘Maharashtra’. This returns students 1 and 5. The student numbers are co listed in Table B, in which students 1 and 5 correspond to PPT-301 and ABT-517, respectively. The course names listed by course numbers are found in Table C. By going to Table C, exact course names corresponding to the course numbers can be retrieved. A final report is then given showing that the students of ‘Maharashtra’ are taking the courses ‘Plant pathology’ and ‘Microbiology’. However, executing the same query through the flat file requires the computer to read through the entire text file word by word and to store the information in a temporary memory space and later mark up the data records containing the word ‘Maharashtra’. This is easily accomplishable for a small database. To perform [31]


K. C. Samal et al.

queries in a large database using flat files obviously becomes enormous task for the computer system. Object-Oriented Databases One of the problems with relational databases is that the tables used do not describe complex hierarchical relationships between data items. To overcome the problem, object-oriented databases have been developed that store data as objects. In an object-oriented programming language, an object can be considered as a unit that combines data and mathematical routines that act on the data. The database is structured such that the objects are linked by a set of pointers defining predetermined relationships between the objects. Searching the database involves navigating through the objects with the aid of the pointers linking different objects. Programming languages like C++ are used to create object-oriented databases. The object-oriented database system is more flexible; data can be structured based on hierarchical relationships. By doing so, programming tasks can be simplified for data that are known to have complex relationships, such as protein structure data. In this case, three objects are constructed and are linked by pointers shown as arrows. Finding specific information relies on navigating through the objects by way of pointers. For simplicity, some of the pointers are omitted. this type of database system lacks the rigorous mathematical foundation of the relational databases. There is also a risk that some of the relationships between objects maybe misrepresented. Some current databases have therefore incorporated features of both types of database programming, creating the object–relational database management system. The above students’ course information can be used to construct an object-oriented database. Three different objects can be designed: student object, course object, and state object. Their interrelations are indicated by lines with arrows. To answer the same question – which courses are students from ‘Maharashtra’ taking – one simply needs to start from ‘Maharashtra’ in the state object, which has pointers that lead to students, 1 and 5 in the student object. Further pointers in the student object point to the course each of the two students is taking. Therefore, a simple navigation through the linked objects provides a final report. [32]


K. C. Samal et al.

Chapter 4

Biological Databases Based on their content, biological databases are divided into primary, secondary, and specialized databases. Primary databases simply archive sequence or structure information; secondary databases include further analysis on the sequences or structures. Specialized databases cater to a particular research interest. Current biological databases use all three types of database structures: flat files, relational, and object oriented. Despite the obvious drawbacks of using flat files in database management, many biological databases still use this format. The justification for this is that this system involves minimum amount of database design and the search output can be easily understood by working biologists. (I) Primary Databases There are three major public sequence databases that store raw nucleic acid sequence data produced and submitted by researchers worldwide: GenBank, European Molecular Biology Laboratory (EMBL) database and DNA Data Bank of Japan (DDBJ), which are all freely available on the Internet. Most of the data in the databases are contributed directly by authors with a minimal level of annotation. A small number of sequences, especially those published in the 1980s, were entered manually from published literature by database management staff. Presently, sequence submission to GenBank, EMBL, or DDBJ is a precondition for publication in most scientific journals to ensure the fundamental molecular data to be made freely available. These three public databases closely collaborate and exchange new data daily. They together constitute the International Nucleotide Sequence Database Collaboration. This means that by connecting to any one of the three databases, one should have access to the same nucleotide sequence data. Although the three databases all contain the same sets of raw data, each of the individual databases has a slightly different kind of format to represent the data. Fortunately, for the three-dimensional structures of biological macromolecules, there is only one centralized database, the PDB. This database archives atomic [33]


K. C. Samal et al.

coordinates of macromolecules (both proteins and nucleic acids) determined by xray crystallography and NMR. It uses a flat file format to represent protein name, authors, experimental details, secondary structure, cofactors, and atomic coordinates. The web interface of PDB also provides viewing tools for simple image manipulation. (a) GenBank GenBank is the most complete collection of annotated nucleic acid sequence data for almost every organism. The content includes genomic DNA, mRNA, cDNA, ESTs, high throughput raw sequence data, and sequence polymorphisms. There are two ways to search for sequences in GenBank. One is using text-based keywords similar to a PubMed search. The other is using molecular sequences to search by sequence similarity using BLAST. GenBank Sequence Format To search GenBank effectively using the text-based method requires an understanding of the GenBank sequence format. GenBank is a relational database. However, the search output for sequence files is produced as flat files for easy reading. The resulting flat files contain three sections – Header, Features, and Sequence entry. There are many fields in the Header and Features sections. Each field has a unique identifier for easy indexing by computer software. Understanding the structure of the GenBank files helps in designing effective search strategies. Header section: The “Header section” describes the origin of the sequence, identification of the organism, and unique identifiers associated with the record. The top line of the Header section is the Locus, which contains a unique database identifier for a sequence location in the database (not a chromosome locus). The identifier is followed by sequence length and molecule type (e.g., DNA or RNA). This is followed by a three-letter code for GenBank divisions. There are 17 divisions in total, which were set up simply based on convenience of data storage without necessarily having rigorous scientific basis; for example, PLN for [34]


K. C. Samal et al.

plant, fungal, and algal sequences; PRI for primate sequences; MAM for nonprimate mammalian sequences; BCT for bacterial sequences; and EST for EST sequences. Next to the division is the date when the record was made public (which is different from the date when the data were submitted). The following line, “DEFINITION,” provides the summary information for the sequence record including the name of the sequence, the name and taxonomy of the source organism if known, and whether the sequence is complete or partial. This is followed by an accession number for the sequence, which is a unique number assigned to a piece of DNA when it was first submitted to GenBank and is permanently associated with that sequence. This is the number that should be cited in publications. It has two different formats: two letters with five digits or one letter with six digits. For a nucleotide sequence that has been translated into a protein sequence a new accession number is given in the form of a string of alphanumeric characters. In addition to the accession number, there is also a version number and a gene index (gi) number. The purpose of these numbers is to identify the current version of the sequence. If the sequence annotation is revised at a later date, the accession number remains the same, but the version number is incremented as is the gi number. A translated protein sequence also has a different gi number from the DNA sequence it is derived from. The next line in the Header section is the “ORGANISM” field, which includes the source of the organism with the scientific name of the species and sometimes the tissue type. Along with the scientific name is the information of taxonomic classification of the organism. Different levels of the classification are hyperlinked to the NCBI taxonomy database with more detailed descriptions. This is followed by the “REFERENCE” field, which provides the publication citation related to the sequence entry. The REFERENCE part includes author and title information of the published work (or tentative title for unpublished work). The “JOURNAL” field includes the citation information as well as the date of sequence submission. The citation is often hyperlinked to the PubMed record for access to the original literature information. The last part of the Header is the contact information of the sequence submitter. [35]


K. C. Samal et al.

Humanliver glucokinase (ATP:D-hexose 6-phosphotransferase) mRNA, complete cds GenBank: M69051.1 LOCUS DEFINITION

HUMGKNASE 2550 bp mRNA linear PRI 29-SEP-1995 Human liver glucokinase (ATP:D-hexose 6-phosphotransferase) mRNA, complete cds. ACCESSION M69051 VERSION M69051.1 GI:183226 KEYWORDS ATP:D-hexose 6-phosphotransferase; glucokinase. SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 2550) AUTHORS Tanizawa,Y., Koranyi,L.I., Welling,C.M. and Permutt,M.A. TITLE Human liver glucokinase gene: cloning and sequence determination of two alternatively spliced cDNAs JOURNAL Proc. Natl. Acad. Sci. U.S.A. 88 (16), 7294-7297 (1991) PUBMED 1871135 COMMENT Original source text: Homo sapiens male adult liver cDNA to mRNA. FEATURES Location/Qualifiers source 1..2550 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" gene 1..2550 /gene="GLK" misc_feature 156..1680 /gene="GLK" exon 204..327 /gene="GLK" /product="glucokinase" /note="ATP:D-hexose 6-phosphotransferase; cassette exon" /EC_number="2.7.1.1" CDS 286..1680 /gene="GLK" /EC_number="2.7.1.1" /product="glucokinase" /protein_id="AAB59563.1" /db_xref="GI:183227" /translation="MPRPRSQLPQPNSQVEQILAEFQLQEEDLKKVMRRMQKEMDRGL RLETHEEASVKMLPTYVRSTPEGSEVGDFLSLDLGGTNFRVMLVKVGEGEEGQWSVKT ………………. RESRSEDVMRITVGVDGSVYKLHPSFKERFHASVRRLTPSCEITFIESEEGSGRGAAL VSAVACKKACMLGQ" variation 602 /gene="GLK" /note="'t' is common in the population; 'c' or 't'" /replace="t" ORIGIN 1 aagccctggg ctgccagcct caggcagctc tccatccaag cagccgttgc tgccacaggc 61 gggccttacg ctccaaggct acagcatgtg ctaggcctca gcaggcagga gcatctctgc 121 ctcccaaagc atctacctct tagcccctcg gagagatggc gatggatgtc acaaggagcc

………………………………….. 2461 ccccatcata tgacatgcca ccctctccat gcccaaccta agattgtgtg ggttttttaa 2521 ttaaaaatgt taaaagtttt aaaaaaaaaa //

Figure: Genbank database

[36]


K. C. Samal et al.

Features section The “Features” section includes annotation information about the gene and gene product, as well as regions of biological significance reported in the sequence, with identifiers and qualifiers. The “Source” field provides the length of the sequence, the scientific name of the organism, and the taxonomy identification number. Some optional information includes the clone source, the tissue type and the cell line. The “gene” field is the information about the nucleotide coding sequence and its name. For DNA entries, there is a “CDS” field, which is information about the boundaries of the sequence that can be translated into amino acids. For eukaryotic DNA, this field also contains information of the locations of exons and translated protein sequences are entered. The third section of the flat file is the sequence itself starting with the label “ORIGIN.” The format of the sequence display can be changed by choosing options at a Display pull-down menu at the upper left corner. For DNA entries, there is a BASE COUNT report that includes the numbers of A, G, C, and T in the sequence. This section, for both DNA and protein sequences, ends with two forward slashes (the “//” symbol). In retrieving DNA or protein sequences from GenBank, the search can be limited to different fields of annotation such as “organism,” “accession number,” “authors,” and “publication date.” Alternatively, a number of search qualifiers can be used, each defining one of the fields in a GenBank file. The qualifiers are similar to but not the same as the field tags in PubMed. For example, in GenBank, [GENE] represents field for gene name, [AUTH] for author name, and [ORGN] for organism name. Alternative Sequence Formats FASTA In addition to the GenBank format, there are many other sequence formats. FASTA is one of the simplest and the most popular sequence formats because it contains plain sequence information that is readable by many bioinformatics analysis programs. It has a single definition line that begins with a right angle bracket (>) followed by a sequence name. Sometimes, extra information such as gi number or comments can be given, which are separated from the sequence name by a “|” symbol. The extra information is considered optional and is ignored by sequence analysis programs. The plain sequence in standard one-letter symbols [37]


K. C. Samal et al.

starts in the second line. Each line of sequence data is limited to sixty to eighty characters in width. The drawback of this format is that much annotation information is lost.

Human liver glucokinase (ATP:D-hexose 6-phosphotransferase) mRNA, complete cds GenBank: M69051.1 GenBank Graphics

>gi|183226|gb|M69051.1|HUMGKNASE Human liver glucokinase (ATP:D-hexose 6phosphotransferase) mRNA, complete cds AAGCCCTGGGCTGCCAGCCTCAGGCAGCTCTCCATCCAAGCAGCCGTTGCTGCCACAGGCGGGCCTTACG CTCCAAGGCTACAGCATGTGCTAGGCCTCAGCAGGCAGGAGCATCTCTGCCTCCCAAAGCATCTACCTCT TAGCCCCTCGGAGAGATGGCGATGGATGTCACAAGGAGCCAGGCCCAGACAGCCTTGACTCTGCCAGACT CTCCTCTGAACTCGGGCCTCACATGGCCAACTGCTACTTGGAACAAATCGCCCCTTGGCTGGCAGATGTG TTAACATGCCCAGACCAAGATCCCAACTCCCACAACCCAACTCCCAGGTAGAGCAGATCCTGGCAGAGTT CCAGCTGCAGGAGGAGGACCTGAAGAAGGTGATGAGACGGATGCAGAAGGAGATGGACCGCGGCCTGAGG CTGGAGACCCATGAAGAGGCCAGTGTGAAGATGCTGCCCACCTACGTGCGCTCCACCCCAGAAGGCTCAG AAGTCGGGGACTTCCTCTCCCTGGACCTGGGTGGCACTAACTTCAGGGTGATGCTGGTGAAGGTGGGAGA AGGTGAGGAGGGGCAGTGGAGCGTGAAGACCAAACACCAGACGTACTCCATCCCCGAGGACGCCATGACC GGCACTGCTGAGATGCTCTTCGACTACATCTCTGAGTGCATCTCCGACTTCCTGGACAAGCATCAGATGA AACACAAGAAGCTGCCCCTGGGCTTCACCTTCTCCTTTCCTGTGAGGCACGAAGACATCGATAAGGGCAT CCTTCTCAACTGGACCAAGGGCTTCAAGGCCTCAGGAGCAGAAGGGAACAATGTCGTGGGGCTTCTGCGA GACGCTATCAAACGGAGAGGGGACTTTGAAATGGATGTGGTGGCAATGGTGAATGACACGGTGGCCACGA TGATCTCCTGCTACTACGAAGACCATCAGTGCGAGGTCGGCATGATCGTGGGCACGGGCTGCAATGCCTG CTACATGGAGGAGATGCAGAATGTGGAGCTGGTGGAGGGGGACGAGGGCCGCATGTGCGTCAATACCGAG TGGGGCGCCTTCGGGGACTCCGGCGAGCTGGACGAGTTCCTGCTGGAGTATGACCGCCTGGTGGACGAGA GCTCTGCAAACCCCGGTCAGCAGCTGTATGAGAAGCTCATAGGTGGCAAGTACATGGGCGAGCTGGTGCG GCTTGTGCTGCTCAGGCTCGTGGACGAAAACCTGCTCTTCCACGGGGAGGCCTCCGAGCAGCTGCGCACA CGCGGAGCCTTCGAGACGCGCTTCGTGTCGCAGGTGGAGAGCGACACGGGCGACCGCAAGCAGATCTACA ACATCCTGAGCACGCTGGGGCTGCGACCCTCGACCACCGACTGCGACATCGTGCGCCGCGCCTGCGAGAG CGTGTCTACGCGCGCTGCGCACATGTGCTCGGCGGGGCTGGCGGGCGTCATCAACCGCATGCGCGAGAGC CGCAGCGAGGACGTAATGCGCATCACTGTGGGCGTGGATGGCTCCGTGTACAAGCTGCACCCCAGCTTCA AGGAGCGGTTCCATGCCAGCGTGCGCAGGCTGACGCCCAGCTGCGAGATCACCTTCATCGAGTCGGAGGA GGGCAGTGGCCGGGGCGCGGCCCTGGTCTCGGCGGTGGCCTGTAAGAAGGCCTGTATGCTGGGCCAGTGA GAGCAGTGGCCGCAAGCGCAGGGAGGATGCCACAGCCCCACAGCACCCAGGCTCCATGGGGAAGTGCTCC CCACACGTGCTCGCAGCCTGGCGGGGCAGGAGGCCTGGCCTTGTCAGGACCCAGGCCGCCTGCCATACCG CTGGGGAACAGAGCGGGCCTCTTCCCTCAGTTTTTCGGTGGGACAGCCCCAGGGCCCTAACGGGGGTGCG GCAGGAGCAGGAACAGAGACTCTGGAAGCCCCCCACCTTTCTCGCTGGAATCAATTTCCCAGAAGGGAGT TGCTCACTCAGGACTTTGATGCATTTCCACACTGTCAGAGCTGTTGGCCTCGCCTGGGCCCAGGCTCTGG GAAGGGGTGCCCTCTGGATCCTGCTGTGGCCTCACTTCCCTGGGAACTCATCCTGTGTGGGGAGGCAGCT CCAACAGCTTGACCAGACCTAGACCTGGGCCAAAAGGGCAGGCCAGGGGCTGCTCATCACCCAGTCCTGG CCATTTTCTTGCCTGAGGCTCAAGAGGCCCAGGGAGCAATGGGAGGGGGCTCCATGGAGGAGGTGTCCCA AGCTTTGAATACCCCCCAGAGACCTTTTCTCTCCCATACCATCACTGAGTGGCTTGTGATTCTGGGATGG ACCCTCGCAGCAGGTGCAAGAGACAGAGCCCCCAAGCCTCTGCCCCAAGGGGCCCACAAAGGGGAGAAGG GCCAGCCCTACATCTTCAGCTCCCATAGCGCTGGCTCAGGAAGAAACCCCAAGCAGCATTCAGCACACCC CAAGGGACAACCCCATCATATGACATGCCACCCTCTCCATGCCCAACCTAAGATTGTGTGGGTTTTTTAA TTAAAAATGTTAAAAGTTTTAAAAAAAAAA

Figure: DNA sequence in FASTA format

[38]


K. C. Samal et al.

Conversion of Sequence Formats In sequence analysis and phylogenetic analysis, there is a frequent need to convert between sequence formats. One of them the most popular computer programs for sequence format conversion is Read seq, written by Don Gilbert at Indiana University. It recognizes sequences in almost any format and writes a new file in an alternative format. The web interface version of the program can be found at: http://iubio.bio.indiana.edu/ cgi-bin/readseq.cgi/.

(II) Secondary Databases Sequence annotation information in the primary database is often minimal. To turn the raw sequence information into more sophisticated biological knowledge, much post processing of the sequence information is needed. This begs the need for secondary databases, which contain computationally processed sequence information derived from the primary databases. The amount of computational processing work varies greatly among the secondary databases; some are simple archives of translated sequence data from identified open reading frames in DNA, whereas others provide additional annotation and information related to higher levels of information regarding structure and functions. The different secondary databases are TrEMBL, SWISSPROT. There are also secondary databases that relate to protein family classification according to functions or structures. The Pfam and Blocks databases contain aligned protein sequence information as well as derived motifs and patterns, which can be used for classification of protein families and inference of protein functions. The DALI database is a protein secondary structure database that is vital for protein structure classification and threading analysis o identify distant evolutionary relationships among proteins. SWISS-PROT A prominent example of secondary databases is SWISS-PROT ((http://www.expasy.ch/), which provides detailed sequence annotation that includes structure, function, and protein family assignment. SWISS-PROT is an annotated protein sequence database, which was created at the Department of [39]


K. C. Samal et al.

Medical Biochemistry of the University of Geneva and has been a collaborative effort of the Department and the European Molecular Biology Laboratory (EMBL), since 1987. SWISS-PROT is now an equal partnership between the EMBL and the Swiss Institute of Bioinformatics (SIB). The EMBL activities are carried out by its Hinxton Outstation, the European Bioinformatics Institute (EBI). The SWISS-PROT protein sequence database consists of sequence entries. Sequence entries are composed of different line types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. A sample SWISSPROT entry is shown below. The SWISS-PROT database distinguishes itself from other protein sequence databases by three distinct criteria: (i) annotations, (ii) minimal redundancy and (iii) integration with other databases. The sequence data are mainly derived from TrEMBL, a database of translated nucleic acid sequences stored in the EMBL database. The annotation of each entry is carefully curated by human experts and thus is of good quality. The protein annotation includes function, domain structure, catalytic sites, cofactor binding, posttranslational modification, metabolic pathway information, disease association, and similarity with other sequences. Much of this information is obtained from scientific literature and entered by database curators. The annotation provides significant added value to each original sequence record. The data record also provides cross-referencing links to other online resources of interest. Other features such as very low redundancy and high level of integration with other primary and secondary databases make SWISS-PROT very popular among biologists. A recent effort to combine SWISS-PROT, TrEMBL, and PIR led to the creation of the UniProt database, which has larger coverage than any one of the three databases while at the same time maintaining the original SWISS-PROT feature of low redundancy, cross-references, and a high quality of annotation.

[40]


K. C. Samal et al.

Rhodopsin [Homo sapiens] NCBI Reference Sequence: NP_000530.1 LOCUS DEFINITION ACCESSION VERSION DBSOURCE KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL PUBMED REMARK

COMMENT

NP_000530 348 aa linear PRI 15-MAR-2014 rhodopsin [Homo sapiens]. NP_000530 NP_000530.1 GI:4506527 REFSEQ: accession NM_000539.3 RefSeq. Homo sapiens (human) Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. 1 (residues 1 to 348) Opefi CA, South K, Reynolds CA, Smith SO and Reeves PJ. Retinitis pigmentosa mutants provide insight into the role of the N-terminal cap in rhodopsin folding, structure, and function J. Biol. Chem. 288 (47), 33912-33926 (2013) 24106275 GeneRIF: Retinitis pigmentosa mutants provide insight into the role of the N-terminal cap in rhodopsin folding, structure, and function. REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from AC080007.26. This sequence is a reference standard in the RefSeqGene project. Summary: Retinitis pigmentosa is an inherited progressive disease which is a major cause of blindness in western communities. This is the transmembrane protein which, when photoexcited, initiates the visual transduction cascade. Defects in this gene are also one of the causes of congenital stationary night blindness. [provided by RefSeq, Jul 2008].

FEATURES source

Location/Qualifiers 1..348 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="3" /map="3q21-q24" 1..348 /product="rhodopsin" /note="opsin 2, rod pigment; opsin-2" /calculated_mol_wt=38762 2..37 /region_name="Rhodopsin_N" /note="Amino terminal of the G-protein receptor rhodopsin; pfam10413" /db_xref="CDD:150994" 37..61 /site_type="transmembrane region" 1..348 /gene="RHO" /gene_synonym="CSNBAD1; OPN2; RP4" /coded_by="NM_000539.3:96..1142" /db_xref="CCDS:CCDS3063.1" /db_xref="GeneID:6010" /db_xref="HGNC:10012" /db_xref="HPRD:01584" /db_xref="MIM:180380"

Protein

Region

Site CDS

ORIGIN 1 61 121 181 241 301

mngtegpnfy vtvqhkklrt geialwslvv eglqcscgid attqkaekev ynpviyimmn

vpfsnatgvv plnyillnla laieryvvvc yytlkpevnn trmviimvia kqfrncmltt

rspfeypqyy vadlfmvlgg kpmsnfrfge esfviymfvv flicwvpyas iccgknplgd

laepwqfsml ftstlytslh nhaimgvaft hftipmiiif vafyifthqg deasatvskt

aaymfllivl gyfvfgptgc wvmalacaap fcygqlvftv snfgpifmti etsqvapa

gfpinfltly nlegffatlg plagwsryip keaaaqqqes paffaksaai

//

Figure: Swissprot protein database (storing information of Rhodopsin protein)

[41]


K. C. Samal et al.

TrEMBL: To accommodate the growing influx of protein sequences without compromising the quality of SWISS-PROT, the protein translations of the EMBL nucleotide sequences that have not been properly curated by human annotators are put into a supplemental database, TrEMBL (Translated EMBL, http://www.expasy.org/sprot). This database serves as a kind of purgatory (or a “halfway house”) for SWISS-PROT. Each TrEMBL entry is assigned a SWISSPROT-type accession number that would stay with it when the sequence is finally manually checked and accepted into SWISS-PROT. To simplify curation, TrEMBL entries are even formatted in the SWISS-PROT style. However, one should be alert to the fact that TrEMBL entries are generated automatically, so their quality is not guaranteed and their annotations should not be considered as solid as those of authentic SWISS-PROT entries. PIR: The PIR (Protein Information Resource, http://pir.georgetown.edu) database is an outgrowth of the Protein Sequence Database, originally created by Margaret Dayhoff, and is currently maintained at the Georgetown University in collaboration with Munich Information Center for Protein Sequences (MIPS, http://mips.gsf.de/proj/protseqdb) in Munich, Germany and the Japanese International Protein Information Database. While technically also a curated database, PIR is far less rigorous than SWISS-PROT in maintaining the quality of its annotations The advantage of PIR, however, is in its hierarchical organization. The definitions of protein family and super-family employed in PIR are far more narrow than those used in most of the other protein databases, particularly motifbased and structure-based ones. Thus, PIR super-families are often composed of very similar proteins, which may be treated by other databases as members of the same family. As a result, more distant relations between proteins (the least trivial and therefore the most interesting ones) are often not represented in PIR at all. Recently, PIR has intensified its protein classification efforts with the creation of iProClass (http://pir.georgetown.edu/iproclass, a protein classification database.

[42]


K. C. Samal et al.

(III) Specialized Databases Specialized databases normally serve a specific researchcommunityor focus on a particular organism. The content of these databases may be sequences or other types of information. The sequences in these databases may overlap with a primary database, but may also have new data submitted directly by authors. Because they are often curated by experts in the field, they may have unique organizations and additional annotations associated with the sequences. Many genome databases that are taxonomic specific fall within this category. Examples include Flybase,WormBase, AceDB, and TAIR. In addition, there are also specialized databases that contain original data derived from functional analysis. For example, GenBank EST database and Microarray Gene Expression Database at the European Bioinformatics Institute (EBI) are some of the gene expression databases available. EcoGene: The EcoGene database provides a set of gene and protein sequences derived from the genome sequence of Escherichia coli K-12. EcoGene is a source of reannotated sequences for the SWISS-PROT and Colibri databases. EcoGene is used for genetic and physical map compilations in collaboration with the Coli Genetic Stock Center. The EcoGene12 release includes 4293 genes. A literature survey identified 717 proteins whose N-terminal amino acids have been verified by sequencing. Users can search and retrieve individual EcoGene Pages or they can download large datasets for incorporation into database management systems, facilitating various genome-scale computational and functional analyses. Saccharomyces Genome Database (SGD) The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae along with search and analysis tools to explore these data, enabling the discovery of functional relationships between sequence and gene products in fungi and higher organisms. Researchers studying larger organisms, including models such as Drosophila and Caenorhabditis, as well as plants and humans, represent growing communities that look to SGD for information when their research leads to genes with similarity to one of the many that are already well characterized in yeast. [43]


K. C. Samal et al.

Educators and students in genetics and cellular biology comprise another large community that SGD serves, as do bioinformatics scientists who perform genomewide computational analyses, for either yeast or comparative studies. ACeDB: ACeDB is a genome database system started in 1989 by Jean Thierry-Mieg (CNRS, Montpellier) and Richard Durbin (Sanger Institute). It was originally developed for the Caenorhabditis elegans genome project from which its name was derived: A C. elegans DataBase. However the tools in it have been generalized to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man. Arabidopsis Information Resource (TAIR): The Arabidopsis Information Resource (TAIR) collects information and maintains a database of genetic and molecular biology data for Arabidopsis thaliana, a widely used model plant. TAIR is managed by the nonprofit Phoenix Bioinformatics Corporation and is supported through institutional, lab and personal subscriptions. Prior funding was provided by the National Science Foundation. The data in TAIR can be searched, viewed using our GBrowse or interactive SeqViewer genome browsers. FlyBase: FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data of the extensively studied species and model organism, Drosophila melanogaster. A wide range of data are presented in different formats. Information in FlyBase originates from a variety of sources ranging from largescale genome projects to the primary research literature. These data types include mutant phenotypes, molecular characterization of mutant alleles and other deviations, cytological maps, wild-type expression patterns, anatomical images, transgenic constructs and insertions, sequence-level gene models and molecular classification of gene product functions. Query tools allow navigation of FlyBase through DNA or protein sequence, by gene or mutant name, or through functional, phenotypic, and anatomical data. The database offers several different query tools in order to provide efficient access to the data available and facilitate the discovery [44]


K. C. Samal et al.

of significant relationships within the database. The FlyBase project is carried out by a consortium of Drosophila researchers and computer scientists at Harvard University and Indiana University in the United States, and University of Cambridge in the United Kingdom. Gramene: The Gramene (http://www.gramene.org/) is a curated, open-source, integrated data resource for comparative functional genomics in crops and model plant species. The Gramene database became a resource for major model and crop plants including Arabidopsis, Brachypodium, maize, sorghum, poplar and grape in addition to several species of rice. Gramene began with the addition of an Ensembl genome browser and has expanded in the last decade to become a robust resource for plant genomics hosting a wide array of data sets including quantitative trait loci (QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm, literature, ontologies and a fully-structured markers and sequences database integrated with genome browsers and maps from various published studies (genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web services including a Distributed Annotation Server (DAS), BLAST and a public MySQL database. Twice a year, Gramene releases a major build of the database and makes interim releases to correct errors or to make important updates to software and/or data. Gramene currently hosts annotated whole genomes in over two dozen plant species and partial assemblies for almost a dozen wild rice species. Online Mendelian Inheritance in Man (OMIM) Online Mendelian Inheritance in Man (OMIM) is a timely, authoritative compendium of bibliographic material and observations on inherited disorders and human genes. It is the continuously updated. Curation of the database and editorial decisions take place at The Johns Hopkins University School of Medicine. OMIM provides authoritative free text overviews of genetic disorders and gene loci that can be used by clinicians, researchers, students, and educators. In addition, OMIM has many rich connections to relevant primary data resources such as bibliographic, sequence, and map information.

[45]


K. C. Samal et al.

Interconnection between Biological Databases As mentioned, primary databases are central repositories and distributors of raw sequence and structure information. They support nearly all other types of biological databases. Therefore, in the biological community, there is a frequent need for the secondary and specialized databases to connect to the primary databases and to keep uploading sequence information. In addition, a user often needs to get information from both primary and secondary databases to complete a task because the information in a single database is often insufficient. Instead of letting users visiting multiple databases, it is convenient for entries in a database to be cross-referenced and linked to related entries in other databases that contain additional information. All these create a demand for linking different databases. The main barrier to linking different biological databases is format incompatibility current biological databases utilize all three types of database structures – flat files, relational, and object oriented. The heterogeneous database structures limit communication between databases. One solution to networking the databases is to use a specification language called Common Object Request Broker Architecture (COBRA), which allows database programs at different locations to communicate in a network through an “interface broker” without having to understand each other’s database structure. It works in a way similar to Hyper Text Markup Language (HTML) for web pages, labeling database entries using a set of common tags. A similar protocol called eXtensible Markup Language (XML) also helps in bridging databases. In this format, each biological record is broken down into small, basic components that are labeled with a hierarchical nesting of tags. This database structure significantly improves the distribution and exchange of complex sequence annotations between databases. Recently, a specialized protocol for bioinformatics data exchange has been developed. It is the distributed annotation system, which allows one computer to contact multiple servers and retrieve dispersed sequence annotation information related to a particular sequence and integrate the results into a single combined report.

[46]


K. C. Samal et al.

Chapter 5 Database retrieval system Databases are fundamental to modern biological research, especially to genomic studies. The goal of a biological database is twofold: information retrieval and knowledge discovery. Entrez: The Entrez (http://www.ncbi.nlm.nih.gov/) is a powerful federated search engine, or web portal that allows users to search for scientific information, DNA, RNA and protein sequences, structures, and bibliographic references. It is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. The name "Entrez" (a greeting meaning "Come in!" in French) was chosen to reflect the spirit of welcoming the public to search the content available from the NLM. Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system. The databases accessible through Entrez are among the most integrated databases. Effective information retrieval involves the use of Boolean operators (AND, OR, NOT). Entrez has additional user-friendly features to help conduct complex searches. One such option is to use Limits, Preview/Index, and History to narrow down the search space. Alternatively, one can use NCBI-specific field qualifiers to conduct searches. To retrieve sequence information from NCBI GenBank, an understanding of the format of GenBank sequence files is necessary. It is also important to bear in mind that sequence data in these databases are less than perfect. There are sequence and annotation errors. Biological databases are also plagued by redundancy problems. There are various solutions to correct [47]


K. C. Samal et al.

annotation and reduce redundancy, for example, merging redundant sequences into a single entry or store highly redundant sequence. Sequence retrieval system Sequence retrieval system (SRS; available at http://srs6.ebi.ac.uk/) is a retrieval system maintained by the EBI, which is comparable to NCBI Entrez. It is not as integrated as Entrez, but allows the user to query multiple databases simultaneously, another good example of database integration. It also offers direct access to certain sequence analysis applications such as sequence similarity searching and Clustal sequence alignment. Queries can be launched using “Quick Text Search” with only one query box in which to enter information. There are also more elaborate submission forms, the “Standard Query Form” and the “Extended Query Form.” The standard form allows four criteria (fields) to be used, which are linked by Boolean operators. The extended form allows many more diversified criteria and fields to be used. The search results contain the query sequence and sequence annotation as well as links to literature, metabolic pathways, and other biological databases.

[48]


K. C. Samal et al.

Chapter 6 Cataloging biological databases Primary nucleotide sequence database The Primary Nucleotide Sequence Database consists of the following databases. ¾ DNA Data Bank of Japan (National Institute of Genetics) ¾ European Nucleotide Archive (European Bioinformatics Institute) ¾ GenBank (National Center for Biotechnology Information) The three databases, DDBJ (Japan), GenBank (USA) and European Nucleotide Archive (Europe), are repositories for nucleotide sequence data from all organisms. All three databases accept nucleotide sequence submissions, and then exchange new and updated data on a daily basis to achieve optimal synchronization between them. These three databases are primary databases, as they house original sequence data.

Meta database: These databases of databases collect data from different sources and make them available in new and more convenient form, or with an emphasis on a particular disease or organism. ¾ BioGraph - A knowledge discovery service based on the integration of more than 20 heterogeneous databases ¾ Bioinformatic Harvester - Integrating 26 major protein/gene resources. ¾ Neuroscience Information Framework (University of California San Diego) - Integrates hundreds of neuroscience relevant resources, many are listed below. ¾ Entrez (National Center for Biotechnology Information) ¾ Enzyme Portal Integrates enzyme information such as small-molecule chemistry, biochemical pathways and drug compounds. (European Bioinformatics Institute) ¾ MetaBase (KOBIC) - A user contributed database of biological databases. [49]


K. C. Samal et al.

¾ PathogenPortal- A repository linking to the Bioinformatics Resource Centers (BRCs) sponsored by the National Institute of Allergy and Infectious Diseases (NIAID) ¾ SOURCE (Stanford University) encapsulates the genetics and molecular biology of genes from the genomes of Homo sapiens, Mus musculus, and Rattus norvegicus into easy to navigate GeneReports

Genome database: These databases collect organism genome sequences, annotate and analyze them, and provide public access. These databases may hold many species genomes, or a single model organism genome. ¾ EcoCyc is a genome database that describes the genome and the biochemical machinery of the model organism E. coli K-12 ¾ SGD is a database that describes the genome & the biochemical and molecular machinery of budding yeast (Saccharomyces cerevisiae). ¾ ACeDB is a database system of a nematode (Caenorhabditis elegans) ¾ TAIR is a genome database system of a widely used model plants Arabidopsis thaliana, ¾ FlyBase is an online bioinformatics database and the primary repository of genetic and molecular data of the extensively studied species and model organism, Drosophila melanogaster ¾ Gramene (http://www.gramene.org/) is a curated, open-source, integrated data resource for comparative functional genomics in crops and model plant species ¾ OMIM (Online Mendelian Inheritance in Man) is a database on inherited disorders and human genes. ¾ CAMERA is a database and repository of Resource for microbial genomics and metagenomics ¾ Corn is the database of the Maize Genetics and Genomics . ¾ PATRIC, the PathoSystems Resource Integration Center ¾ RegulonDB RegulonDB is a model of the complex regulation of transcription initiation or regulatory network of the cell E. coli K-12.

[50]


K. C. Samal et al.

Protein sequence databases: ¾ UniProt Universal Protein Resource (EBI, Swiss Institute of Bioinformatics, PIR) ¾ Protein Information Resource (Georgetown University Medical Center (GUMC)) ¾ Swiss-Prot Protein Knowledgebase (Swiss Institute of Bioinformatics) ¾ PEDANT: Protein Extraction, Description and ANalysis Tool ¾ PROSITE: Database of Protein Families and Domains ¾ Database of Interacting Proteins (Univ. of California) ¾ Pfam: Protein families database of alignments and HMMs (Sanger Institute) ¾ PRINTS: a compendium of protein fingerprints from (Manchester University). It is a database of (super-family and family) annotations for all completely sequenced organisms ¾ InterPro Classifies proteins into families and predicts the presence of domains and sites. Proteomics database: ¾ Proteomics Identifications Database (PRIDE) A public repository for proteomics data, containing protein and peptide identifications and their associated supporting evidence as well as details of post-translational modifications. (European Bioinformatics Institute) ¾ MitoMiner - A mitochondrial proteomics database integrating large-scale experimental datasets from mass spectrometry and GFP studies for 12 species. (MRC Mitochondrial Biology Unit) Protein structure databases: ¾ Protein Data Bank (PDB) comprising: ¾ Protein Data Bank in Europe (PDBe) ¾ Protein Data bank in Japan (PDBj) ¾ Research Collaboratory for Structural Bioinformatics (RCSB) Secondary databases ¾ SCOP (Structural Classification of Proteins) ¾ CATH Protein Structure Classification [51]


K. C. Samal et al.

¾ PDBsum Protein model databases: ¾ Swiss-model Server and Repository for Protein Structure Models ¾ ModBase Database of Comparative Protein Structure Models (Sali Lab, UCSF) ¾ Protein Model Portal (PMP) Meta database that combines several databases of protein structure models (Biozentrum, Basel, Switzerland) RNA databases ¾ Rfam, a database of RNA families ¾ miRBase, the microRNA database ¾ snoRNAdb, a database of snoRNAs ¾ lncRNAdb, a database of lncRNAs ¾ piRNAbank, a database of piRNAs ¾ GtRNAdb, a database of genomic tRNAs ¾ SILVA, a database of ribosomal RNAs ¾ RDP, the Ribosomal Database Project Carbohydrate structure databases ¾ EuroCarbDB, A repository for both carbohydrate sequences/structures and experimental data. Protein-protein interactions: ¾ BIND Biomolecular Interaction Network Database ¾ BioGRID, A General Repository for Interaction Datasets (Samuel Lunenfeld Research Institute) ¾ CCSB Interactome ¾ DIP Database of Interacting Proteins ¾ IntAct molecular interaction database: a central, standards-compliant repository of molecular interactions, including protein–protein, protein– small molecule and protein–nucleic acid interactions. ¾ NetPro ¾ STRING: STRING is a database of known and predicted protein-protein interactions. (EMBL) [52]


K. C. Samal et al.

¾ MINT: Molecular INTeraction database Metabolic pathway databases: ¾ Small Molecule Pathway Database (SMPDB) ¾ BioCyc Database Collection including EcoCyc and MetaCyc ¾ KEGG PATHWAY Database (Univ. of Kyoto) ¾ MANET database (University of Illinois) Microarray databases ¾ ArrayExpress (European Bioinformatics Institute) ¾ Gene Expression Omnibus (National Center for Biotechnology Information) ¾ GPX(Scottish Centre for Genomic Technology and Informatics) ¾ Stanford Microarray Database (SMD) (Stanford University) ¾ Genevestigator - Expression Search Engine (Nebion AG) PCR and quantitative PCR primer databases: ¾ PathoOligoDB: A free QPCR oligo database for pathogens ¾ RTPrimerDB - a public primers and probes database for real-time PCR reactions Taxonomic databases: ¾ Catalogue of Life source databases ¾ Encyclopedia of Life ¾ Integrated Taxonomic Information System ¾ EzTaxon-e, database for the identification of prokaryotes based on 16S ribosomal RNA gene sequences

[53]


K. C. Samal et al.

Chapter 7 Pairwise Sequence Alignment In this document we illustrate how to perform pairwise sequence alignments using the Biostrings package through the use of the pairwise Alignment function. This function aligns a set of pattern strings to a subject string in a global, local, or overlap (ends-free) fashion with or without an e gaps using either a fixed or quality-based substitution scoring scheme. Each of these pairwise sequence alignment problems is solved by maximizing the alignment score. An alignment score is determined by the type of pairwise sequence alignment (global, local, overlap), which sets the ranges for the substrings; the substitution scoring scheme, which sets the distance between aligned characters; and the gap penalties, which is divided into opening and extension components. The optimal pairwise sequence alignment is the pairwise sequence alignment with the largest score for the specied alignment type, substitution scoring scheme, and gap penalties. There are 3 methods for pairwise sequence alignment: 1) dot plot, 2) global alignment, and 3) local alignment.

Dot Plot The simplest method is the dot plot. One sequence is written out horizontally, and the other sequence is written out vertically, along the top and side of an m x n grid, where m and n are the lengths of the two sequences. A dot is placed in a cell in the grid wherever the two sequences match. A diagonal line in the grid visually shows where the two sequences have sequence identity. Webbased dot plot implementations can be found here: http://www.vivo.colostate.edu/molkit/dnadot/ – for nucleotide sequence only http://emboss.bioinformatics.nl/cgi-bin/emboss/dotmatcher - for both nucleic acid & protein sequence with standard EMBOSS scoring matrices [54]


K. C. Samal et al.

http://www.changbioscience.com/res/resd.html – for any text string Stand-alone dot plot programs operable via either a GUI or command-line can be found in EMBOSS (JEMBOSS is the java GUI)

Global Alignment: The algorithm published by Needleman and Wunsch in 1970 for alignment of two protein sequences was the first application of dynamic programming to biological sequence analysis. The Needleman-Wunsch algorithm finds the bestscoring global alignment between two sequences. Global alignments are most useful when the two sequences being compared are of similar lengths, and not too divergent.

Local Alignment: Real life is often complicated, and we observe that genes, and the proteins they encode, have undergone exon-shuffling, recombination, insertions, deletions, and even fusions. Many proteins exhibit modular architecture. In searching databases for similar sequences, it is useful to find sequences that have similar domains or functional motifs. Smith & Waterman (1981) published an application of dynamic programming to find optimal local alignments. The algorithm is similar to Needleman-Wunsch, but negative cell values are reset to zero, and the trace back procedures starts from the highest scoring cell.

Scoring Matrices The Needleman-Wunsch and Smith-Waterman algorithms require a scoring matrix. The scoring matrix assigns a positive score for a match, and a penalty for a mismatch. For nucleotide sequence alignments, the simplest scoring matrix awards +1 for a match, and -1 for a mismatch. The blastn algorithm at NCBI scores +5 for a match and -4 for a mismatch. These scoring matrices treat all mutations (mismatches) equally. In reality, transitions (pyrimidine -> pyrimidine and purine -> purine) occur much more frequently than transversions (pyrimidine > purine and vice versa). For aligning non-protein coding DNA sequences, a [55]


K. C. Samal et al.

transition/transversion scoring matrix may be more appropriate. For aligning DNA sequences that encode proteins, alignment of the protein amino acid sequences will almost always be more reliable. For protein sequence alignments, the scoring matrices are more complicated. The goal is to reflect evolutionary processes. Some amino acid sequence changes can arise from a single nucleotide change, whereas other amino acid changes require two nucleotide changes. Some amino acid changes are less likely to affect protein structure or function than other amino acid changes. Dayhoff used alignments of highly conserved proteins to assess what amino acid changes were likely to be accepted – Point Accepted Mutations. From these data she devised a 20 x 20 amino acid substitution matrix for PAM-1, a unit of evolutionary change resulting in 1 accepted mutation per 100 amino acids. From there she calculated other matrices such as PAM-2 or PAM-30 or PAM-250, where the PAM-n matrix is derived by multiplying the PAM-1 matrix to itself n times. The BLOSUM matrices (BLOcks SUbstitution Matrix) derive their amino acid substitution frequencies from the Blocks database of un-gapped local multiple sequence alignments. BLOSUM62 is calculated from sequences with 62% identity or less; BLOSUM 80 from sequences with 80% or less.

Gap penalty Sequence alignments usually require insertion of gaps, reflecting insertion or deletion mutations. If a nucleotide or amino acid in one sequence is aligned to a gap in the target sequence, then this should be penalized as a mismatch. However, gaps at the ends of sequences should perhaps not incur any penalty. Moreover, a single insertion or deletion mutation could result in a contiguous gap of multiple residues. Therefore, a single gap that is 3 residues long should incur less penalty than 3 different gaps, of one residue each.

[56]


K. C. Samal et al.

Chapter 8 Multiple sequence alignment Multiple Sequence Alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. Visual depictions of the alignment as in the image at right illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides. Multiple sequence alignment also refers to the process of aligning such a sequence set. Because three or more sequences of biologically relevant length can be difficult and are almost always time- consuming to align by hand, computational algorithms are used to produce and analyze the alignments. MSAs require more sophisticated methodologies than pairwise alignment because they are more computationally complex. Most multiple sequence alignment programs use heuristic methods rather than global optimization because identifying the optimal alignment between more than a few sequences of moderate length is prohibitively computationally expensive.

[57]


K. C. Samal et al.

Chapter 9 Practical Exercises Exercise 1: Making search for the scientific literature and sequences Theory: The most fundamental skill in bioinformatics is the ability to carry out an efficient and comprehensive search of the scientific literature to find out what is known about a specific subject. All of you are familiar with web search engines and while they can be useful, they also turn up many items that have never undergone the test of scientific peer review. Thus, this exercise is NOT a search of the World Wide Web, but will introduce you to search the published scientific literature using a database such as MEDLINE, Biological Abstracts or Chemical Abstracts. This exercise will focus on the ‘Entrez browser’ entry to the national library of medicine database MEDLINE (PubMed). PubMed is a database service of the National Library of Medicine that cites articles from MEDLINE and life science journals.

Procedure:

[58]


K. C. Saamal et al.

1. To browse the World Wide Web, just open your favourite internet browser ox). (Internet eex xplorer, Google chrome or Mozilla Fireffo 2. In the address bar, type the URL (http://www.ncbi.nlm.nih.gov/pubmed) and press ‘Enter key’ on your keyboard. The Homee page of your T y site (here ( PubM Med) as shhown below w will apppear. A search win ndow and a text box will be diisplayed where w you will w type few f key w words releevant to youur search topic. t To search scientific or T o bibliogrraphic literrature in PuubMed, typpe key worrd(s) or p phrase(s) into i the query box (e.g., a subjeect, author and/or jouurnal).

3. Type your key words and click the ‘Search’ button. IIf necessarry, combinne search terms t withh connectorr words: “AND,” “ “O OR,” or “ “NOT” using upper case letters. PubMedd offers altternative searching s o options: T Auto Suggest drrop-down menu appears whenn entering words; The w andd Titles w your search term with ms option may m appeaar after a seearch. PubMed displays P d a list of Results in Suummary foormat after clicking on the ‘Search’ button. To retrievee more infoormation about T a citatiion(s), use the Displaay Settingss link to c change how w the results are form matted, sorrted and dissplayed. Filters aree available in the lefft navigatiion bar annd may be used to limit F l or f focus searrches. Clicck on a terrm to activvate or deactivate thhe filter. Multiple M f filters may y be seleccted. Thee Filters activated message m apppears aboove the search resu ults list annd these lim mits remaiin in effecct until rem moved or cleared. c T reveal additional filter options, click the Choose additional filters or To o more l links. Checck desired selections then clickk the Show w button. [599]


K. C. Samal et al.

4. For any entry in the Results list, click associated author names. Search details, located in the right navigation column, provide information on how PubMed ran a search. PubMed looks first for the entire word or phrase as a MeSH term, then for journal titles, then authors. PubMed also searches “All Fields” for the term. Search details shows how PubMed maps terms to MeSH headings and subheadings. Changes to the search may be made in the Details box; click Search to run the updated search strategy

5. Save what you like to your hard drive by choosing your browser’s File: Save as option.

[60]


K. C. Samal et al.

Exercise-2: Characterization of a Known Gene URL: (http://www3.ncbi.nlm.nih.gov/qguery/gquery.fcgi) Theory: In this exercise, you will use ‘Entrez’ to find entries for the coding sequence of a gene of interest. You will use glucokinase as an initial example (glucokinase is the enzyme that catalyzes the initial step of glycolysis in liver and several other cell types).: Procedure:

1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox). [61]


K. C. Samal et al.

2. In the address bar, type the URL (https://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi) and press ‘Enter key’ on your keyboard. The Home page of your site as shown below will appear. 3. In the left column, click ‘Nucleotide’ button. A search window and a text box will be displayed where you will enter your desired nucleotide. 4. In the top search box type in ‘glucokinase’ (without the quotes) and click on the Go Button. You will get about 1000 entries listed on more than 50 pages of 20 entries each. This is an unwieldy number, so you will have to figure out a way to narrow your search. There are two ways in general to narrow a search, the use of the Limits menu within Entrez or the use of Boolean operators (AND, OR, NOT)). (The present search will pick up all entries in the database that have the word glucokinase ANYWHERE in the entry (e.g. an entry that contains a line stating "Gene X has nothing to do with glucokinase" will come up as a hit in this search). You can eliminate some entries by adding after glucokinase in the search box NOT similar NOT hypothetical. This will eliminate entries listed only because they are noted to be similar to glucokinase. Additional filters can be applied to our search by using the Limits tab just below the search box. 5. Click on the Limits tab. If you are interested only in the coding regions of glucokinase genes (i.e. DNA sequences obtained from mRNA for glucokinase), you can eliminate genomic sequences with their large introns. 6. In the "Molecule" pull-down menu select mRNA and click on the "GO" button.. [62]


K. C. Samal et al.

Note how many hits are now listed. You still have entries that are not glucokinase. To further narrow your search clicks on the Limits tab one more time. In the top left drop down menu change from All Fields to Title. This will limit this search to those entries that have glucokinase in their title line. Still, you will note that your entries include not only glucokinase but also glucokinase regulatory proteins and other entries that have the term glucokinase in the title.

Result: • Clicking on the accession number for one of your entries will bring up the full Nucleotide sequence information. Most of the information in an entry is self-explanatory, but if you scroll down to the Features entry you should find a CDS entry. This specifies that part of the nucleotide sequence below that actually codes for a protein (often you will find untranslated regions at both the 3' and 5' ends of a sequence). In addition, the translated sequence is given in the one letter amino acid shorthand just above the full nucleotide sequence. • To obtain the sequence in a form which can be analyzed by a variety of gene analysis software, select FASTA from the Display pull down menu. The browser will give you a page which has the sequence without any line numbers or breaks. Save the sequence by selecting the material beginning with the > and going up to the last nucleotide (be sure to avoid the line above the > and below the last nucleotide) and copying this to a word processor program. The > line is recognized as comment by all analysis software. You can change the font to courier 10 point to obtain the proper spacing and lines. OBTAINING A PROTEIN FASTA ENTRY • To compare protein sequences, you will want to obtain the protein FASTA output.

[63]


K. C. Samal et al.

• To obtain this change the Display menu back to the GeneBank Display and scroll down until you reach the CDS information. Click on the link in the line that begins /protein_id= "xxx1234" (i.e. whatever the assigned protein id number is). • This will change the display to GenPept and bring up a page which shows some of the same information, but is limited to the amino acid sequence. In this page, change the Display menu to FASTA to obtain an output similar to the nucleotide FASTA output (an index line which begins with > and an amino acid sequence). You can copy the index line and sequence to a word processor for use later (once you are in the word processor, again change the text to courier 10 pt to retain line spacing). • SAVE THE PROTEIN FASTA OUTPUTS (glucokinase from mammal species of your choice) to a word processor program. You will compare the sequences of these proteins in a future exercise.

[64]


K. C. Samal et al.

Exercise 3: Finding out open reading frames (ORF) through NCBI ORF finder URL: http://www.ncbi.nlm.nih.gov/gorf/gorf.html

Theory: Open reading frames are regions of DNA that encode the protein. This DNA sequences are first transcribed into mRNA then translated into protein. By examining the sequence alone, you can determine the sequence of amino acid that will appear in the final problem. In translation codon of three nucleotides determines which amino acid will be added next in joining protein chain. It is important then to decide which nucleotide start translation and when to stop, this is sequenced it in important to determine the correct open reading frames. So in each direction i.e., 1, 2, 3 in forward and -1, -2, -3 in backward. The reading frame that is used determines which amino acid will be conceded by a gene. Typically one reading frame is used in translating a gene (in cukaryotes) and this is often the largest ORF. Once the ORF is known DNA sequence can be translated into the corresponding amino acid sequence. An ORF starts with an ATG (methionine) in most of the species and ends in a stop codon (UAA, UAG, UGA) indicated by * in the protein sequence.

Procedure: 1. To browse the World Wide Web, just open your internet browser (Internet explorer, Google chrome or Mozilla Firefox etc)/ 2. In the address bar, type www.ncbi.nlm.nih.gov/gorf/gorf.html and press ‘Enter key’ on your keyboard or click go button. Here one can see a text field to enter the GI or accession number of the query sequence, a text box to enter the query sequence in FASTA format and a button to run the ORF finder. [65]


K. C. Saamal et al.

3. Type the nucleotide sequence in the box provided ((iin FASTA format) or copy your nucleotide sequence from a .txt file or word document file and passtte the sequence in the inpuutt box. FASTA fo F ormat is a simplest sequence foormat whicch starts with w a ‘>’ symbol s f followed by the sequence ID, otheer commeents and computattionally r represente s d protein sequence). There is a drop downn menu to select a geenetic codoon dictionaary. It conttains 20 T d different codon c dictiionaries thhat containn codons for f differennt organism ms and o organelles . Select anny from thee list whichh you wantt for the seaarch methood. The f first one iss the "standdard" whicch is the deefault codoon. Select default d coddon list ‘Standard’’. (For exam mple, the standard s coode AUG code c for methionine. m . But in V Vertebrate e Mitochonndrial Codde and Yeaast Mitochhondrial Code, AUA A codes f methio for onine). 4. Now Click the ORF finder button to get the result. The result shows thee all the poossible sixx reading frame T fr preseent in the entered e q Onee can see thhat the OR RF is listedd accordingg to their size and sequence query. t graphiccal represeentation of the of the sequence. the 5. Click on the green region which represents the ORF in the sequence, to see the ORF. Once you click, it will O w turn innto purplee colour inndicating thhat the particular O ORF is seelected. The T selecteed ORF iss also inddicating in the list. It also d displays th he length annd locationn of the sellected ORF F

[666]


K. C. Samal et al.

One can see the sequence of the selected ORF which actually codes for the protein. The user can find the start codon, stop codon and the total number of the amino acids from the sequence. Now click on Accept button. User can also perform a BLAST search for the particular ORF that you selected. Select the appropriate program and database. Then click on the BLAST button.

. [67]


K. C. Samal et al.

Exercise 4: Translating an unknown DNA Sequence URL: http://web.expasy.org/translate/ Theory: One of the most basic exercises in bioinformatics is determining if a nucleic acid sequence actually codes for a protein. This is complicated by the fact that you generally do not know which strand is the coding strand (i.e. whether the sequence itself or its complementary strand will be transcribed into mRNA) nor the correct reading frame (whether the sequence should be read three bases at a time starting with the first nucleotide, the second or the third. Both these questions are resolved by translating both strands in all three reading frames and looking for the one that gives the longest amino acid sequence before a stop codon is encountered. Since there are 64 codons and three of these codons (UAA, UAG and UGA) do not code for any amino acid (i.e. are stop signals). You expect a stop codon to appear on average once every 20 amino acids if you are reading a sequence in the incorrect frame. However, things are not always that clear cut and it is possible for an out of frame translation to extend to over 100 amino acids before a stop codon is reached. In the exercise below you will be given an unknown DNA sequence and asked to use a web tool to translate the sequence into an amino acid sequence and hopefully identify the proper reading frame. You will then save this amino acid sequence to a word processing program for use it in the next exercise.

Requirement The sequence might be obtained by sequencing a clone from a cDNA library or by isolating an amplified DNA fragment from PCR amplification. Otherwise you get a sequence from nucleic acid sequence database as studied earlier.

[68]


K. C. Samal et al.

Procedure: 1. To browse the World Wide Web, just open your internet browser (Internet explorer, Google chrome or Mozilla Firefox). 2. In the address bar, type the URL http://web.expasy.org/translate/ and press ‘Enter key’ on your keyboard. A new window will open to assess the translation tool. (Translating the DNA sequence is done by reading the nucleotide sequence three bases at a time and then looking at a table of the genetic code to arrive at an amino acid sequence. This program examines the input sequence in all six possible frames (i.e. reading the sequence from 5' to 3' and from 3' to 5' starting with nucleotide at position 1, 2 and 3 separately). What you typically look for in identifying the proper translation is the frame that gives the longest amino acid sequence before a stop codon is encountered. (Since there are 64 codons and three code for nonsense, you expect a stop codon to appear on average once every 20 amino acids if you simply read a sequence "out of frame". However, "on average" is just that, and it is possible to have an incorrect reading frame give an extended sequence with no stop codons. The next exercise will address that problem). 3. Type or paste your sequence in the sequence window in the ExPasy link for translation. Under Output format select either ‘Compact’ or ‘Verbose’. ‘Compact’ gives the amino acid sequence as one letter codes with stop codons indicated by a hyphen whereas ‘Verbose’ gives the amino acid sequence as three letter codes 4. Select Output format clicking either ‘Compact’ or ‘Verbose’ 5. Click on Translate Sequence Often only one reading frame will give you a translation with no stop codons, but this is not always the case. If you get multiple possible reading frames, one way to determine which is the most likely the true frame is to use the

[69]


K. C. Samal et al.

BLAST program to determine if the sequence corresponds to any known protein sequence. Using the "Compact output" to get one letter sequences, copy the one letter sequence of the best reading frame (i.e. one with no stop codons) and paste it into the window below labeled "Best Guess". 6. Copy the longest amino acid sequence (i.e. no hyphen) of one of the other reading frames to the window below labeled "Second Best". If you have two reading frames without a stop codon, simply copy each to the boxes below. 7. Copy and save each sequence to a word processor for use in next exercise. Best reading Frame Amino acid sequence from next best Frame (don't include the stop)

Conclusion: You have now been introduced to the use of a translation program to identify the most probable reading frame and to translate an unknown sequence. What if none of the six possible reading frames gives an extended amino acid sequence? This could be due to your having errors in sequence (you need to sequence both strands to ensure an accurate sequence). Or you may have isolated a non-coding region of DNA (e.g. you know that the 5' and 3' ends of most genes are not coding for protein, but serve regulatory functions. There are many untranslated regions of DNA (exons, pseudogenes, etc). You can now take the two amino acid sequences and determine if either matches any known sequences in the huge protein sequence database

[70]


K. C. Samal et al.

Exercise 5: Identifying a gene using BLAST program URL: http://blast.ncbi.nlm.nih.gov/Blast.cgi Theory: Once you have identified a likely reading frame for your DNA sequence, you will want to see if it corresponds to any known protein. Alternatively, if you obtained two reading frames of nearly equal length, you will need to decide which is correct. To accomplish these tasks, you can compare your sequences to all of the known protein sequences in the databases using a search tool known as BLAST. BLAST comes in a variety of formats depending on whether you are using a DNA sequence or a amino acid sequence and depending on whether you are searching through nucleotide or protein databases. You are going to do this exercise twice. First, you will take the longest open reading frame and use it as a query sequence with BLASTP. After saving those results, you will then take the next longest amino acid sequence and use it as our query sequence.

Procedure 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press ‘Enter key’ on your keyboard. The blast page at the NCBI appears as shown below. 3. Under Basic BLAST heading, click Protein BLAST [blastp] link A search page will appear as shown below. 4. Paste your longest translated sequence into the first box below. 5. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu. 6. Deselect the Do CD-search box. [71]


K. C. Samal et al.

Scroll down this page to the Format Section - in this section use the pull-down menus to change the Descriptions to 10 and the Alignments to 10. Change the Layout to One Window. You will leave the Options section settings on the Default values. 7. Click the BLAST button at the bottom or top of the screen A new window will appear gives an estimate of how long the search will take and which lists conserved domains in your query sequence. You may want to copy your request id number, but usually this isn't necessary. After the indicated time has passed, 8. Press the Format button The results of your search will be dispayed. If similarity to any known protein has been found, you will see a color window (which may or may not print) showing the degree of similarity and the range of similarity. Perfect matches show up as red, next best as purple, mediocre as green, poor matches as blue and very poor or no match as black. If you scroll down you will see the best 10 alignments (make sure you have limited this to 10!). If the DNA sequence has already been identified it should show up as a perfect match (score generally between 200-400, but could be lower depending on size of peptide analyzed. The E value will be down around 10(-50) to 10(-100)).The E value tells you the probability that an unrelated sequence in the database could have given the score value. Copy the line below the color alignment window which shows the sequence producing the best alignment. This will give you the identifiers (gi number and other identifying numbers) you will need to download the full protein from the database for characterization. Save this information.

[72]


K. C. Samal et al.

[73]


K. C. Samal et al.

Exercise 6: Finding Domains in Protein Sequences Theory: Many proteins which have been classified as "globular (i.e. folded into a compact globular shape) appear to be composed of several distinct folded regions joined by more extended loops of amino acids. These globular sub-regions are termed "domains" and can range in size from 20-300 amino acids. Some domains have been associated with specific functions (e.g. catalysis of peptide bond cleavage, ATP binding, etc), but this association must be tentative since ligand binding or formation of an active site often takes place at the surface where two domains interact. Identification of domains can help us to assign a newly discovered open reading frame to a family of proteins. Domains in a newly discovered protein can be recognized by sequence homology with known domains in well characterized proteins, but this is still not a precise science. While new techniques of analysis are being introduced, at the present the most user-friendly and visual domain identification program is the SMART domain annotation database.

Procedure 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://smart.emblheidelberg.de/smart/set_mode.cgi?NORMAL=1 and press ‘Enter key’ on your keyboard. The requested page at appears as shown below 3. Copy the full sequence of the protein identified in the previous Exercise and past it into SMART sequence window. 4. Click the Sequence SMART button.

[74]


K. C. Samal et al.

Depending on how busy the SMART server is, it may take a few minutes for a result to be returned. BE PATIENT!! The results will show you a live diagram with the domains within the query sequence. Each domain has a unique color and shape and annotation. Scroll down the window to see a table that lists each identified domain together with its putative (probable) start and end point in your sequence and the probability (E-value) assigned to that identification (the smaller the evalue the more likely the identification is not simply due to chance). 5. Click the mouse over the domain on the figure or in the table. It will bring up the domain name or abbreviation and the amino acid sequence assigned to this domain at the very bottom of the window. With a PC, right click on the image to save it as a PNG file. It can be opened Photoshop or most any other reader. 6. Click on the domain name It will bring up more detailed information on the domain. Pick out one domain to examine in detail. What are the characteristics (amino acid sequences) that define that domain? What kinds of proteins contain this domain? What is the function of that domain? How similar is your sequence to the defined domain?

[75]


K. C. Samal et al.

Exercise 7: Nucleotide BLAST (BLASTn): URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi Theory: The BLAST (Basic Local Alignment Search Tool) programs have been designed for speed to find high scoring local alignments. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity BlastN is a pair wise sequence comparison tool developed by NCBI and the programme compares a nucleotide query sequence with nucleotide sequence data base. It takes nucleotides sequences and compares them against the NCBI nucleotide databases. It is better at finding sequences similar, but not identical, to your query.

Procedure: 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press ‘Enter key’ on your keyboard. The blast page at the NCBI appears as shown below. 3. Under Basic BLAST heading, click Nucleotide BLAST [blastn] link A search page will appear as shown below. 4. Paste your nucleotide sequence into the first box below. 5. Choose nr database from the choose database pull-down menu. Then click the requisite option in different places as per our requirement. Otherwise leave as such the programme will take all default option 6. Deselect the Do CD-search box. [76]


K. C. Samal et al.

Scroll down this page to the Format Section - in this section use the pull-down menus to change the Descriptions to 10 and the Alignments to 10. Change the Layout to One Window. You will leave the Options section settings on the Default values and will address these choices in a more advanced exercise. 7. Click the BLAST button at the bottom or top of the screen Then click the requisite option in different places as per our requirement. Otherwise leave as such the programme will take all default option. 8. Then click the BlastN option at the end of the submission page. After few second the result of your blast programme will appear in a new window. The first part shows a Graphic View of the matches, followed by a list of the matches and then the Individual Alignments. In the result page a number of hits were displayed. Out of large number of sequence those hits were chosen on basis of lowest e- value. The sequences showing e- value is more similar to each other.

[77]


K. C. Samal et al.

[78]


K. C. Samal et al.

Exercise 8Protein BLAST (Blastp): URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi THEORY – BlastP is a pair wise sequence comparison tool developed by NCBI and the programme compares a amino acid query sequence of a protein with amino acid sequence of protein data base. It takes amino acid sequences and compares them against the NCBI protein databases. The program allows to discover the structures and functions of proteins. BlastP uses the BLAST algorithm to compare an amino acid query sequence against a protein sequence database.

Procedure: 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press ‘Enter key’ on your keyboard. The blast page at the NCBI appears as shown below. 3. Under Basic BLAST heading, click Protein BLAST [blastp] link A search page will appear as shown below. 4. Paste your amino acid sequence of a protein or longest translated sequence into the first box below. 5. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu. Then click the requisite option in different places as per our requirement. Otherwise leave as such the programme will take all default option. 6. Deselect the Do CD-search box. Scroll down this page to the Format Section - in this section use the pull-down menus to change the Descriptions to 10 and the Alignments to 10. Change the [79]


K. C. Samal et al.

Layout to One Window. You will leave the Options section settings on the Default values and will address these choices in a more advanced exercise. 7. Click the BLAST button at the bottom or top of the screen After few second the result of our blast programme will appear in a new window. The first part shows a Graphic View of the matches, followed by a list of the matches and then the Individual Alignments. Here a number of hits were displayed. Out of large number of sequence, those hits were chosen on basis of lowest e- value. The sequences showing e- value is more similar to each other.

[80]


K. C. Samal et al.

Exercise-9 Translated BLAST (Blastx) URL - http://blast.ncbi.nlm.nih.gov/Blast.cgi Theory: Blastx searches protein database using a translated nucleotide query. Blastx uses the BLAST algorithm to compare the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. The BLAST (Basic Local Alignment Search Tool) programs have been designed for speed to find high scoring local alignments. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity

Procedure: 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press ‘Enter key’ on your keyboard. The blast page at the NCBI appears as shown below. 3. Under Basic BLAST heading, click Protein BLAST [blastx] link After clicking a new page appear. This is the sequence submission page 3. Enter the Nucleotide sequence into the Search dialog box. 4. Use the default settings to search the Non-redundant protein sequences (nr) database. Then click the requisite option in different places as per our requirement. Otherwise leave as such the programme will take all default option. Select the search and format options that you want for your data output. For some proteins you may gets hundreds of hits. Therefore, you would limit the number on the first search. Recheck that all the information is correct. [81]


K. C. Samal et al.

5. To submit the request, Click the BLAST button at the bottom or top of the screen. After few second the result of our blastx programme will appear in a new window. Number of hits will be displayed. The blastx report is very similar to the blastn report. The first part shows a Graphic View of the matches, followed by a list of the matches and then the Individual Alignments. The BLASTX search with the same sequence shows a significant number of very good matches. Out of large number of sequences those hits were chosen on basis of lowest e- value.

[82]


K. C. Samal et al.

Exercise 10: tBLASTX URL: http://www.ncbi.nlm.nih.Gov/BLAST Theory: TBlastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database using the BLAST algorithm.

PROCEDURE – 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press ‘Enter key’ on your keyboard. The blast page at the NCBI appears as shown below. 3. Under Basic BLAST heading, click [tblastx] link. That searches translated nucleotide database using a translated nucleotide After clicking a new page appear. This is the sequence submission page 3. Enter the nucleotide sequence into the Search dialog box. 4. Use the default settings to search the Non-redundant protein sequences (nr) database. Then click the requisite option in different places as per our requirement. Otherwise leave as such the programme will take all default option. Select the search and format options that you want for your data output. For some proteins you may gets hundreds of hits. Therefore, you would limit the number on the first search. Recheck that all the information is correct. 5. To submit the request, Click the BLAST button at the bottom or top of the submission page screen.

[83]


K. C. Samal et al.

After few second the result of our blast programme will appear in a new window. Out of large number of sequences those hits were choose on basis of lowest e- value. The sequences showing e- value is more similar to each other.

[84]


K. C. Samal et al.

Exercise 11: PSI-BLAST (position specific interacted BLAST) URL: http://www.ncbi.nlm.nih.Gov/BLAST Theory: Position specific iterative BLAST (PSI BLAST) was created in 1997. PSIBLAST represents an extension of BLAST where position specific scoring is used. What this means is that when looking for word matches in the database, you create a “profile” or family for the words you are looking for. Once you found all matches within a certain significance threshold, you use the obtained profiles to refine the search by repeating the procedure. This allows us to find more significant matches. The profiles are represented as substitution matrices.

Procedure 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://blast.ncbi.nlm.nih.gov/Blast.cgi and press ‘Enter key’ on your keyboard. The blast page at the NCBI appears as shown below. 3. Under Basic BLAST heading, click Protein BLAST [blastp] link A search page will appear as shown below. 4. Under program selection heading, click PSI-BLAST (Position-Specific Iterated BLAST) button 5. Paste your protein sequence in search window section or simply write the GI number of the protein. 6. Choose Uni-ProtKB/ Swiss-Prot from the choose database pull-down menu. Then click the requisite option in different places as per our requirement. Otherwise leave as such the programme will take all default option. Enter the threshold values that determine how divergent the protein that you are [85]


K. C. Samal et al.

interested in finding one. The rest of the parameter are generally used at the set default settings. 7. Then click on the BLAST button to initiate the first round of PSI BLAST search. The time it takes can be longer than what it says on screen. Be patient. An intermediate page (entitled Reformatting Blast) appears containing a ‘Format’ button. 8. Click this Format button. A new page appears in a new window entitled results of Blast., This is where your results will be displayed when ready. 9. Inspect the results. There are many very similar sequences and only a few distantly related. 10. Click on the run PSI BLAST iteration 2 button (Near the top of the page). The Reformatting Blast window pops up. 11. Click the Format button on the Reformatting Blast window. The results will appear in the results of the Blast window. This can be repeated till a convergence of protein is achieved or known further convergence is possible. 12. Continue repeating Steps 10-11. The results will appear in the results of the Blast window. PSI BLAST output consists of many iterations. Each iteration has a hit list, the alignment and the parameters used for the analysis of PSI BLAST. Each iteration page contains an interaction button to go through the next interaction.

Conclusion PSI BLAST program is most widely used protein similarity search program among the entire BLAST program. PSI BLAST offers exiting opportunities to

[86]


K. C. Samal et al.

discover new type of relationship in protein data base and use to infer evolutionary origins of protein. The PSI BLAST is a highly sensitive homology search program.

[87]


K. C. Samal et al.

Exercise 12: Sequence alignment through FASTA URL:

http://www.ebi.ac.uk/Tools/sss/fasta/

Theory: Compare a protein sequence to a protein sequence database using the FASTA algorithm (Pearson and Lipman, 1988, Pearson, 1996). It provides sequence similarity searching against protein databases using the FASTA suite of programs. FASTA provides a heuristic search with a protein query. FASTX and FASTY translate a DNA query. Optimal searches are available with SSEARCH (local), GGSEARCH (global) and GLSEARCH (global query, local database. Search speed and selectivity are controlled with the ktup (wordsize) parameter. For protein comparisons, ktup = 2 by default; ktup =1 is more sensitive but slower. Procedure: 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://www.ebi.ac.uk/Tools/sss/fasta/ and press ‘Enter key’ on your keyboard. The FASTA homepage will appear in which the different options like program, database, result, search title, your email, matrix, gap extension, ktup, expected lower value, DNA strand, histogram, mode type, score, alignment, sequence pair database range, filter statistical estimate. 3. Under Basic Program heading, click Protein link. 4. Select the date base from data base pull sown menu. 5. Paste your sequence or upload the file containing sequence. 6. Set your parameters. Matrix: Matrix option is used to set the matrix which is used for searching the data base.

[88]


K. C. Samal et al.

Gap penalties: it has two options one is Gap opening and Gap extension. Default gap opening penalty for proteins is -12 and -16 for DNA. The gap extension penalty is -2 for protein and -4 for DNA. Score: Score option gives the maximum number of reported scores in the output file. K-tup: Change this value to limit the word length. The search should use. Strand: This option let you chose which strand to search with the respective data band. Histogram: Selecting this option to ‘yes’ will display the search histogram of the expected frequency of chance occurrence of the data base matches found. Expectation value upper limit and lower limit: This option is used for score an alignment display. The default values for upper limit are 10.0 for protein search, • Sequence range: This option allows the user to denote which region within the query seq. should be searched. • Database range option sets the sequence range to search within the dbs. • Multype : The multypeoptionis used to choose the molecule type of the query in use for a search. • Filter: This option can eliminates statistical significance but biological uninteresting reports from the first FASTA search. • Statistical estimates option is used for statistical calculations. • Then click the requisite option in different places as per our requirement. Otherwise leave as such the programme will take all default option. 7. Then click the submit button. For some proteins you may gets hundreds of hits. Therefore, you would limit the number on the first search. Recheck that all the information is correct. A histogram along with the alignment will come.

[89]


K. C. Samal et al.

[90]


K. C. Samal et al.

Exercise 13: Editing and analyzing multiple sequence alignment using Jalview URL: http://www.jalview.org/ Theory: Jalview is a piece of bioinformatics software that is used to look at and edit multiple sequence alignments. It is written in the Java programming language. Jalview is a free program for multiple sequence alignment editing, visualization and analysis. Jalview has a wide range of functions and is used to view and edit sequence alignments, analyze them with phylogenetic trees and principal components analysis (PCA) plots and explore molecular structures and annotation.

Procedure: 1. To browse the World Wide Web, just open your favourite internet browser (Internet explorer, Google chrome or Mozilla Firefox etc). 2. In the address bar, type http://www.jalview.org/ and press ‘Enter key’ on your keyboard. 3. Paste the MSA or on align sequences into the seq. window then click the run button so that an initial result page will appear. 4. Then the browser returns a page which loads the java applet into the memory of the computer then inside this page the word Jalview appears as a button. Then click on the Jalview button to obtain the result. Jalview can run in offline. For this load the Jalview into the computer when selected the file menu and click word offline option in the internet browser. In the Jalview window select file and then click input alignment via text box. Paste the MSA in the text box and then selected the format that correspondence to the MSA for the alignment format top down menu. Then click, the apply button. Result

:

¾ In the result page of Jalview edit a group of sequence by using the editing window in the result page. [91]


K. C. Samal et al.

¾ In the pop of window click the odd new group button and then the add selected ids button. Then click apply and choose button to finish. ¾ Then choose edit and when a group editing mode from the main menu and then click or anywhere on a sequence and drag to the left or right to insert or remove gaps. ¾ Save the alignment that is produced from Jalview by using the following options: •

Choose file and then output alignment via textbox from the Jalview main menu. • Then select the alignment format and click apply button to get a formatted alignment appears in the window. • Then open a Microsoft word document. Select, copy and paste the alignment from the Jalview textbox to the word document and save the document. ¾ For publishing the multiple sequence alignment use the box shed utility which sheds the column according to their level of conservations and produces files that are useful for publication.

Conclusion: Jalview is a online and offline tool for editing and analyzing the NSA which gives a good looking format which can then be used for publishing.

[92]


K. C. Samal et al.

Exercise 14: Making multiple alignment with T-coffee URL: http://www.ch.embnet.org/soiltware/Tcoffee.html Theory: T-Coffee (Tree-based Consistency Objective Function For alignment Evaluation) is the multiple sequence alignment software using a progressive approach. It generates a library of pairwise alignments to guide the multiple sequence alignment. T-Coffee has two main features. First, it provides a simple and flexible means of generating multiple sequence alignments, using heterogeneous data sources. The data from these sources are provided to T-Coffee via library of pairwise alignments. The second main feature of T-Coffee is the optimization method, which is used to find the multiple alignment that best fits the pair-wise alignments in the input library. You use a so-called progressive strategy which is similar to that used in ClustalW. This has the advantage of being fast and relatively robust. TCoffee is a progressive alignment with an ability to consider information from all of the sequences during each alignment step, not just those being aligned at that stage.

Procedure 1. Open any internet browser like Internet explorer, Google chrome etc. 2. In the address bar write NCBI and click on enter button then Home page will come. 3. Search for any two or more nucleotide sequences in FASTA format and copy it on Microsoft word page. 4. Open new internet tab and search for T-Coffee. 5. Home page will come Point the browser to the T-coffee server homepage. 6. Click the mouse over make a multiple alignment in the table and click regular. By clicking the mouse the multiple alignment page appears. [93]


K. C. Samal et al.

7. Enter the E-mail address in that page, so that if the job time will out the result can be returned by E-mail. 8. Paste the sequences in the box used for alignment and then click the T-Coffee button at the top or the button of the page to obtain the result. 9. Click on Submit button Then T-Coffee alignment result will come

Result T-Coffee returns a table that contains hyperlinks to the result. The first row of the table is duplicated to multiple sequence alignment and includes ¾ Aln- A text file in the same format as clustalW alignments ¾ HTML- A colourised alignment where every residue appears on a background that indicates quality of this alignment. Rcad indicates high quality segments while blue indicates no trusted region. ¾ Pdf- It can be easy or to display and print due to pdf file. The second row dedicated to phylogenetic tree and includes: ¾ Dnd- The guide tree or dendrogram generated by Tcoffee in newick format. ¾ Ph- This is a real phylogenetic tree in newick format using the neighbor joining method. ¾ Png – The gif picture of the phylogenetic tree that corresponds to the Ph file. Advantages ¾ It produces more accurate alignments than the other methods. ¾ It is equipped with many different tools and modules such as CORE, Mcoffee and EXPRESSO for structure alignment, evaluation and combining alignments. ¾ T-coffee can deal with many input formats, including FASTA, Swiss-Prot and PIR (Protein Information Resource). ¾ T-coffee produces sequence alignment in various formats so that it can be used as an input for another program. It also produces a colorized [94]


K. C. Samal et al.

alignment where every residue appears on a background that indicates the quality of this alignment in (.html) and (.pdf) format. ¾ It can produce true phylogenetic tree in Newick format by using the Neighbor Joining method. ¾ It can work with list of DNA, RNA or protein sequences. ¾ T-coffee can evaluate the quality of any multiple sequence alignments using CORE server. Disadvantages ¾ It takes longer time to align multiple sequences than other programs. ¾ It has been cited in limited number of peer reviewed journals compared to ClustalW.

Conclusion: T-coffee is a progressive alignment method and in many ways it is similar to clustalw but the main difference is that T-coffee does not directly use substitution matrices to align sequences. It is much lazier and simply relies on other methods to work for it.

[95]


K. C. Samal et al.

Exercise 14 Performing Online Mendelian Inheritance in Man (OMIM) URL:- http://www.ncbi.nlm.nih.gov/omim / www.omim.org Theory:OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. OMIM is authored and edited at the McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, under the direction of Dr. Ada Hamosh. Its official home is omim.org.(According to NCBI)

[96]


K. C. Samal et al.

Procedure:¾ Open any internet browser Internet explorer/google chrome/mozilla firefox ¾ In the address bar type OMIM and press enter or search ¾ Different websites with little explanation are appeared in new page. Study the listed websites and then click any one of them till you get your require information. ¾ Open any internet browser ¾ In the address box type www.google.com ¾ In the search box type OMIM ¾ Different websites with little explanations will appear ¾ Study the listed web site and click any one http://www.ncbi.nlm.nih.gov/omim / www.omim.org

of

them

i.e

¾ When you type www.omim.org you get its home page ¾ On the search box of that page type any human gene suppose insulin, then click on search ¾ On the new page you will get different aspects on human gene ¾ From that click on the desired aspect ¾ Suppose you click on #610549 Icd+ Diabetes Mellitus, Insulin-Resistant, with Acanthosis nigricans

[97]


K. C. Samal et al.

Exercise 16 Studying about Protein Structure Database URL: http://www.rcsb.org/pdb/home/home.do http://scop.mrc-lmb.cam.ac.uk/scop/ http://www.cathdb.info/ Theory: The Protein Data Bank (PDB) is a repository for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids.. The data, typically obtained by X-ray crystallography or NMR spectroscopy and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organizations (PDBe, PDBj, and RCSB). The PDB is overseen by an organization called the Worldwide Protein Data Bank (wwPDB) Procedure: 1. Open any Internet browser or google chrome or mozilla firefox etc. 2. In the address bar click www.google.com 3. Then in search bar type PDB,SCOP and CATH. 4. Press enter or click the search button. 5. Different websites with little explanation are appeared in new page. 6. Study the listed websites and anyone of that till you get then write.

[98]


K. C. Samal et al.

Exercise 17 Depositing sequences in database URL: BankIt [http://www.ncbi.nlm.nih.gov/BankIt/], Sequin http: //www.ncbi.nlm. nih.gov/Sequin/index.html The GenBank sequence database is an annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced at National Center for Biotechnology Information (NCBI) as part of an international collaboration with the European Molecular Biology Laboratory (EMBL) Data Library from the European Bioinformatics Institute (EBI) and the DNA Data Bank of Japan (DDBJ). GenBank and its collaborators receive sequences produced in laboratories throughout the world from more than 100,000 distinct organisms. GenBank continues to grow at an exponential rate, doubling every 10 months. Direct submissions are made to GenBank using 1. BankIt [http://www.ncbi.nlm.nih.gov/BankIt/], which is a Web-based form, or the stand-alone submission program, or 2. Sequin [http: //www.ncbi.nlm. nih.gov/Sequin/index.html]. Upon receipt of a sequence submission, the GenBank staffs assign an Accession number to the sequence and perform quality assurance checks. The submissions are then released to the public database, where the entries are retrievable by Entrez or downloadable by FTP. Bulk submissions of Expressed Sequence Tag (EST), Sequence Tagged Site (STS), Genome Survey Sequence (GSS), and High-Throughput Genome Sequence (HTGS) data are most often submitted by large-scale sequencing centres. The GenBank direct submissions group also processes complete microbial genome sequences.

Submission Tool: Direct submissions to GenBank are prepared using one of two submission tools, BankIt or Sequin. [99]


K. C. Samal et al.

Exercise 18: Submitting sequences to Genbank through ‘BankIt’ URL: [http://www.ncbi.nlm.nih.gov/BankIt/] Theory: BankIt is a Web-based form that is a convenient and easy way to submit a small number of sequences with minimal annotation to GenBank. To complete the form, a user is prompted to enter submitter information, the nucleotide sequence, biological source information, and features and annotation pertinent to the submission. BankIt has extensive Help [http://www.ncbi.nlm.nih.gov/ BankIt/help.html] documentation to guide the submitter. Included with the Help document is a set of annotation examples that detail the types of information that are required for each type of submission. After the information is entered into the form, BankIt transforms this information into a GenBank flat file for review. In addition, a number of quality assurance and validation checks ensure that the sequence submitted to GenBank is of the highest quality. The submitter is asked to include spans (sequence coordinates) for the coding regions and other features and to include amino acid sequence for the proteins that derive from these coding regions. The BankIt validator compares the amino acid sequence provided by the submitter with the conceptual translation of the coding region based on the provided spans. If there is a discrepancy, the submitter is requested to fix the problem, and the process is halted until the error is resolved. To prevent the deposit of sequences that contain cloning vector sequence, a BLAST similarity search is performed on the sequence, comparing it to the VecScreen [http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html] database. If there is a match to this database, the user is asked to remove the contaminating vector sequence from their submission or provide an explanation as to why the screen was positive. Completed forms are saved in ASN.1 format, and the entry is submitted to the GenBank processing queue. The submitter receives confirmation by email, indicating that the submission process was successful. [100]


K. C. Samal et al.

Requirements for GenBank Submissions through ‘BankIt’ Contact Information Name, address, phone number, fax number and email address of the submitter must be entered when registering and submitting for the first time. Subsequent BankIt submissions will retain this information and display it once the submitter logs in Release date information Immediately after it is processed at NCBI OR On a date the submitter specifies Reference information Sequence authors: names of the researchers who are credited with the sequence Publication information: Unpublished, In-Press, or Published; and applicable citation information (paper's title, authors, journal title, volume, issue, year, pages) Submission Category and Type Original sequencing or Third Party Annotation, Single sequence, sequence set (phylogenetic, population, environmental, etc), or batch Nucleotide sequence(s) ¾ Input (cut-and-paste) single or multiple sequences OR ¾ Upload them as a FASTA file; FASTA files should include organisms in their definition lines ¾ Sequences must be at least 200 nucleotides long (unless they are complete exons, non-coding RNAs (ncRNAs), microsatellites or ancient DNA) ¾ Molecule type: what was sequenced? (genomic DNA, mRNA, genomic RNA, cRNA, etc) ¾ Topology: linear or circular (circular must be complete, such as a complete plasmid)

[101]


K. C. Samal et al.

Organism name, applicable source modifiers, location ¾ Genus and species names (if not previously provided in FASTA file) ¾ If name is new or unrecognized, provide best known taxonomic lineage ¾ If genus and/or species names are not known, provide most specific name known (for example: Bacillus sp., Uncultured bacterium, Uncultured archaeon) ¾ Most complete name for any synthetic vector (for example: Cloning vector pAB234, Transfer vector p789Abc) ¾ Source modifiers include: strain, clone, isolate, specimen-voucher, isolation-source, country ¾ Location: organelle (mitochondrion, chloroplast, etc); map and/or chromosome Features of the sequence Upload files or use input forms to add all applicable features (for example: CDS, gene, rRNA, tRNA, microsatellite, exon, intron)

[102]


K. C. Samal et al.

Exercise 19: Submitting sequences to Genbank through ‘Sequin’ URL: http://www.ncbi.nlm.nih.gov/Sequin/index.html Theory: Sequin is more appropriate for complicated submissions containing a significant amount of annotation or many sequences. It is a stand-alone application available on NCBI's FTP [ftp://ftp.ncbi.nih.gov/sequin/] site. Sequin creates submissions from nucleotide and amino acid sequences in FASTA format with tagged biological source information in the FASTA definition line. As in BankIt, Sequin has the ability to predict the spans of coding regions. Alternatively, a submitter can specify the spans of their coding regions in a five column, tabdelimited table [http://www.ncbi.nlm.nih.gov/Sequin/table.html] and import that table into Sequin. For submitting multiple, related sequences, e.g., those in a phylogenetic or population study, Sequin accepts the output of many popular multiple sequence-alignment packages, including FASTA+GAP, PHYLIP, MACAW, NEXUS Interleaved, and NEXUS Contiguous. It also allows users to annotate features in a single record or a set of records globally.

Procedure for Depositing Sequence by Sequin: ¾ Open any internet browser like Internet explorer, Google chrome etc. ¾ In the address bar write NCBI and click on enter button then Home page will come. ¾ Click on Submissions: Submit data to GenBank or other NCBI databases ¾ Click on Genbank option Homepage will occur. ¾ Click on Submission Tools option ¾ Click on Sequin Tool. ¾ In How to Get Sequin click on Instruction ¾ Download the free downloaded Sequin software. ¾ Install it in your PC or Laptop. [103]


K. C. Samal et al.

¾ Open that software by double clicking on software icon then this Welcome page will occur.

¾ Click on Start New Submission button. ¾ Sequin is organized into a series of forms for entering submitting authors, entering organism and sequences, entering information such as strain, gene, and protein names, viewing the complete submission, and editing and annotating the submission. Author Submission Form

¾ The Sequence Format form asks for the type of submission (single sequence, segmented sequence, or population, phylogenetic, or mutation [104]


K. C. Samal et al.

study). For the last three types of submission, which involve comparative studies on related sequences, the format in which the data will be entered also can be indicated. The default is FASTA format (or raw sequence), but various contiguous and interleaved formats (e.g., PHYLIP, NEXUS, PAUP, and FASTAGAP) are also supported. These latter formats contain alignment information, and this is stored in the sequence record. ¾ The Organism and Sequences form asks for the biological data. On the Organism page, as the user starts to type the scientific name, the list of frequently used organism’s scrolls automatically. (Sequin holds information on the top 800 organisms present in GenBank.). Thus, after typing a few letters, the user can fill in the rest of the organism name by clicking on the appropriate item in the list. Sequin now knows the scientific name, common name, GenBank division, taxonomic lineage, and, most importantly, the genetic code to use. (For mitochondrial genes, there is a control to indicate that the alternative genetic code should be used.) For organisms not on the list, it may be necessary to set the genetic code control manually. Sequin uses the standard code as the default. The remainder of the Organism and Sequences form differs depending on the type of submission. Organism and Sequences Form

¾ The goal is to go quickly from raw sequence data to an assembled record that can be viewed, edited, and submitted to your database of choice. [105]


K. C. Samal et al.

¾ Advance through the pages that make up each form by clicking on labelled folder tabs or the Next Page button. After the basic information forms have been completed and the sequence data imported, Sequin provides a complete view of your submission, in your choice of text or graphic format. ¾ At this point, any of the information fields can be easily modified by double-clicking on any area of the record, and additional biological annotations can be entered by selecting from a menu. ¾ Sequin has an on-screen Help file that is opened automatically when you start the program. ¾ Because it is context sensitive, the Help text will change and follow your steps as you progress through the program. A "Find" function is also provided. ¾ Sending the Submission - A finished submission can be saved to disk and E-mailed to one of the databases. It is also a good practice to save frequently throughout the Sequin session, to make sure nothing is inadvertently lost. The list at the end of this chapter provides E-mail addresses and contact information for the three databases.

[106]


K. C. Samal et al.

Exercise 20 Primer designing URL: http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi Theory: Oligo-nucleotides, also referred to as primers, are short single strands of nucleic acids that are synthesized from either DNA or RNA in order to bind to a complementary strand. Primers have a target area where they bind and act as the starting point for polymerase to extend from, and thus determine what segment of DNA gets amplified. DNA consists of a double stranded helix. One strand of the DNA is named the “sense” strand and the other strand is the “anti-sense” strand. These two DNA strands are complements of each other. During PCR, the denaturing step will break the hydrogen bonds, separating the two strands. This allows the primers to anneal to the target region on the DNA during the annealing step. One primer is designed to anneal to the sense strand and the other primer needs to bind to the anti-sense strand. When designing primers for PCR it is necessary to take into consideration things like: how many primers are needed, the length of the primer, the 5’ and 3’end, the mutation location in primer, the primer melting/annealing temperature, the G-C content, “primer dimmer” and the distance between the forward and reverse primers. Length The length of the primers need to between 15 and 30 base pairs so that they are long enough for adequate specificity and short enough for them to anneal to the DNA template. The 5’ and 3’end The primers need to be designed so that the 3’ end of the forward primer will extend toward the reverse primer. The 3’ end of the reverse primer need to also extend toward the forward primer. The 3’ ends of the forward and reverse primers [107]


K. C. Samal et al.

should be facing each other from opposite DNA strands. This will facilitate the continued replication of the desired strand of DNA. If, for instance, the 3’ ends do not elongate in opposite directions (i.e., toward each other) replication will not work and a PCR product will not be obtained. Primer Melting Temperature The Primer Melting Temperature (Tm) is important for the annealing phase of PCR. Preferred temperatures should be between 50°C and 65°C. The forward and reverse primer melting temperatures should be no more than 2°C different. To calculate the Tm, Tm=4°C x (#G’s + C’s in the primer) + 2°C x (# A’s + T’s). The G-C content The primer sequence should be relatively high as it has a direct relationship with the Tm. There should be a base composition of G-C of about 50%-60%. The 3’ end of the primer should finish with at least one G or C to promote efficiency in annealing due to the stronger bonding. Distance between the Forward and Reverse The forward primer and the reverse primer should be between 300 and 2,000 base pairs apart. Beware of “Primer Dimer” Primer Dimer is an artefact of PCR where primers bind to each or to themselves other instead of the template DNA and thus act as their own template to make a small PCR product and appear faintly on an electrophoresis gel. To avoid “primer dimers”, be sure there are not many complementary areas in the base sequence of your forward and reverse primers where the primer strands would be able to bind to each other instead of the gene. Things to Avoid ¾ To avoid non-specific binding, design the primers with high annealing temperatures. ¾ To make sure the primers designed will only bind to the target area submit the sequence to the BLAST website. [108]


K. C. Samal et al.

¾ The MgCl2 and pH conditions can also be adjusted for improved amplified product. ¾ Watch out for runs of singles bases of G’s, C’s, A’s, and T’s when developing primers because they can allow mis-priming. ¾ Keep in mind that the more nucleotide bases that the primer is made up of, the more expensive they are. The shorter the primers are, the less specificity they have in PCR. Resources for General Purpose PCR Primer Design ¾ Primer3 ¾ Primer3Plus ¾ PrimerZ ¾ PerlPrimer

Aim: - Primer Design on the Web Using Primer3 for STAR-1 GENE in rice URL: http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi

Procedure ¾ Collect the sequence for which primer has to design, in Fasta format from NCBI home page. ¾ Open the source web site -http://frodo.wi.mit.edu/cgibin/primer3/primer3_www.cgi ¾ Paste the sequence in fasta format in space of the home page of the website. ¾ Set the defaults and click ‘pick primers’ to get the result.

[109]


K. C. Samal et al.

[110]

Protein sequence databases

Protein sequence databases

Suggest Documents

RSDB: representative protein sequence databases have high ...

MHC: sequence databases

Proteinâprotein interaction databases - BioMedSearch

Databases for Protein-Ligand Complexes

TargetFinder: searching annotated sequence databases for target ...

Genome Sequence Databases (Overview): Sequencing and Assembly

DNA & Protein Sequence Comparison

Protein-level assembly increases protein sequence

the pir integrated protein databases and data

Integrated web visualizations for protein-protein interaction databases ...

Protein Three-Dimensional Structural Databases ... - Semantic Scholar

Integrated web visualizations for protein-protein interaction databases ...

Protein identification with sequence tags

Underlying orderin protein sequence organization

Intelligent Access to Sequence and Structure Databases (IASSD) â an ...

Reference-Based Indexing of Sequence Databases - VLDB Endowment

Large-scale compression of genomic sequence databases with the ...

Supplementary Table 1. List of databases, sequence ...

Sequence Alignment in HIV Computational Analysis - HIV Databases

mlstdbNetâdistributed multi-locus sequence typing (MLST) databases

Evaluating the Impact of Different Sequence Databases on ... - PLOS

An Efficient Model for Similarity Search in DNA Sequence Databases

Sequence-based prediction of protein-protein ... - Springer Link

Evolution of protein structural classes and protein sequence families

Protein sequence databases