2 Practical of Basic bioinformatics module

25 downloads 211 Views 1MB Size Report
2 nd. Practical of Basic bioinformatics module. 2.1. UniProtKB/Swiss-Prot. The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of ...
2nd Practical of Basic bioinformatics module 2.1. UniProtKB/Swiss-Prot The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added. This includes widely accepted biological ontologies, classifications and crossreferences, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data. The UniProt Knowledgebase consists of two sections: a section containing manually-annotated records with information extracted from literature and curator-evaluated computational analysis, and a section with computationally analysed records that await full manual annotation. For the sake of continuity and name recognition, the two sections are referred to as "UniProtKB/SwissProt" (reviewed, manually annotated) and "UniProtKB/TrEMBL" (unreviewed, automatically annotated), respectively. Manual annotation consists of a critical review of experimentally proven or computer-predicted data about each protein, including the protein sequences. Data are continuously updated by an expert team of biologists. We will now see how to search the UniProtKB, and which information contains a UniProtKB/Swiss-Prot entry. Step 1. Enter http://www.uniprot.org/ in your browser search tab, or just click on the link. The main page of the UniProt shows up (Figure 1). The UniProt main page is well organized and helpful. There are several important things on the start page that one can use to find exactly what he/she needs. For now, we will not deal in details with possibilities and tools on the UniProt web page. We will just search the databases with our old example Pho5, and see what information can be found in the Swiss-Prot entry. Step 2. Type Pho5 in the search bar and click on Search. There are 21 matches to the Pho5 query (possibly more by the time this practical is done). The results are displayed as in the Figure 2. The results are retrieved from both TrEMBL and SwissProt sections. You can see in which section the entry belongs by the star symbol (see Figure 2, red label). The entries with the yellow star belong to the Swiss-Prot section, whereas the one without it belong to the TrEMBL section. There are several ways to retrieve just the reviewed (Swiss-Prot) entries.

Available tools Search bar

News feed

Video tutorial What can be found here

Detailed help

Figure 1. The UniProt main page. The most important components are labelled in red.

Figure 2. The UniProt result page. Labelled in red is the section status bar, labelled in green are section filter and other filters, and labelled in blue is the Download link.

If you need to retrieve just the reviewed you can do it by performing the search and the click Show only reviewed link on the result page (Figure 2, green label). Another way to retrieve just Swiss-Prot entries is to use Advanced search settings. The retrieved entries can be downloaded in several ways. You just have to select the entries you want to download press the Download link (Figure 2, labelled blue) and choose the output format of wanted entries. For example, you can download the Accession number list or sequences in FASTA format of the chosen entries. Step 3. Find the Pho5 protein (repressible acid phosphatase) from the organism Saccharomyces cerevisiae of the retrieved entries, and click on the Entry link. After doing this step, the chosen entry is displayed. We will now have a detailed look on the Swiss-Prot entry. There are ten different sections in the Swiss-Prot entry. Each section has one or more subsections which contain entry specific information. Because of the great number of sections and subsections, one can get lost in the aboundacy of information. That’s why clicking on the name of section or subsections, the help windows pop-out with description of it. We will now breafly look at the sections present in the Swiss-Prot entry, and get some general picture of the information found in the database about concerned protein. The first three sections are “Names and origin”, “Protein attributes”, and “General annotation (comments)”, shown in Figure 3. It contain basic information about the protein.

Figure 3. The sections “Names and origin”, “Protein attributes”, and “General annotation (comments)” for the Pho5 protein

Figure 4. The “Ontologies” section

Figure 5. The “Sequence annotation (Features)” section

The “Name and origin” section, as the name itself says, contain the name of the protein with alternative names, including E.C. number (recall the Biochemistry I class for definition of E.C. number). It also includes the name of the gene coding for the protein, as well as the organism of origin with the taxonomic lineage. Note that there are links to the UniProt Taxonomy database to each Taxonomy class. You can access the taxonomic data on the UniProt Taxonomy database and the NCBI Taxonomy database using the NCBI unique identifier (taxonomic identifier). The following section “Protein attributes” contains some useful information about the entry. In the case of Pho5, you can see that the protein has the length of 467 aa, was completely sequenced, and the existence of protein was confirmed by one or more analytical methods. You can also see that this protein is subjected to post-translational modification. The third section “General annotation (Comments)” generally contains biochemically important information, for example type of reaction yielded by the enzyme, post-translational modifications, cell compartment etc. Note that there are yellow links at the end of certain subsections. Those links, which looks like Ref. [0-9], links to the section “References”, where publications in which the information is found can be retrieved.

Figure 6. The “Sequence” section

The next section is the “Ontologies” section (Figure 4). In information science, ontology generally refers as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain. An example is Gene Ontology. The Gene Ontology project is a major

bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases. The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data from GO Consortium members, as well as tools to access and process this data. In this section the keywords for specific field can be found, as well as the entry from the Gene Ontology project database. The section “Sequence Annotation” (Figure 5.) provides a precise but simple means for the annotation of sequence data. It describes regions or sites of interest in the protein sequence. In general this section lists post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included in this section. The section “Sequence” (Figure 6.) displays by default the canonical protein sequence and upon request all isoforms described in the entry. It also includes information pertinent to the sequence(s), including length and molecular weight. The protein sequence displayed by default is the protein sequence to which all positional annotation of the “Sequence annotation” section refers. It is called the ‘canonical’ sequence. Note that this section also includes various tools which can be used to analyse the sequence (e.g. BLAST, Compute pI, MW, etc.). The sequence can be easily exported as a FASTA sequence just by clicking on the FASTA link. There are few more sections. The “Reference” section, as mentioned before contains the list of publication, as well as the links to those publications, from which information for the annotation was retrieved. Each reference has a part which explains which information is gained from that publication. The “Cross-reference” section provides links and unique identifiers which points to collections and/or databases other than UniProtKB. And at the end, there are the sections “Entry information” and “Relevant documents”. In the “Entry information” section, the information like entry submission time, last modify time, accession number etc., can be found. The section “Relevant documents” contains links to documents relevant to the entry (for example genome sequence and annotation, protein family, etc.).

2.2. KEGG Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database resource that integrates genomic, chemical and systemic functional information. In particular, gene catalogs from completely sequenced genomes are linked to higher-level systemic functions of the cell, the organism and the ecosystem. Major efforts have been undertaken to manually create a knowledge base for such systemic functions by capturing and organizing experimental knowledge in computable forms; namely, in the forms of KEGG pathway maps, BRITE functional hierarchies and KEGG modules. Continuous efforts have also been made to develop and improve the crossspecies annotation procedure for linking genomes to the molecular networks through the KEGG Orthology system. Step 1. Enter http://www.genome.jp/kegg/ in your browser search tab, or just click on the link. The KEGG main page shows up (Figure 1). KEGG offers a wide variety of options and information. In this practical we will consider just a portion of all KEGG possibility which could be of most interest. KEGG contains a database of metabolite pathways of wide variety of organisms which can be explored.

Figure 1. The KEGG main page

Step 2. Click on the KEGG PATHWAY link on the KEGG main page, then in the section 0. Global map click the Metabolic pathways link. A figure as in Figure 2 appears. This figure represents the reference metabolic pathway. The dots on the figure represent the metabolites, whereas the lines represent metabolic reactions. By pointing your mouse cursor on the dot, the unique identifier with the picture of the compound is shown (Figure 3a). Clicking on the dot, the entry of the concerned metabolite is shown. By pointing on the line, identifiers to the concerned reaction are shown. Those identifiers are given to enzymes carrying the reaction (EC numbers), orthologous genes for that reaction, the reaction itself (Figure 3b). By clicking on the line representing the reaction, it is possible to see the entries involved in this reactions, which are the entries linked in previously mentioned unique identifiers.

Figure 2. The gloabal metabolic pathway

Step 3. In the scroll down menu find the organism Rickettsia prowazekii, select it and click on the Go button. A figure a little bit different than Figure 2 is displayed (Figure 4). Rickettsia prowazekii is a obligate intracellular parasitic, aerobic bacteria that is the etiologic agent of epidemic typhus, transmitted in the feces of lice. Because it is a obligate parasite, it lacks many metabolic pathways. The grey dots and lines on Figure 4 represent the reactions and metabolites which this bacterium lacks.

a)

b)

Figure 3. a) Figure of the metabolite pointed in red and b) unique idetifiers of entries involved in the reaction signed in red

Now that we have seen how metabolic pathways can be retrieved and explored for specific organism, now it's time to see how genes and/or proteins are represented in KEGG. First we have to return to the KEGG main page.

Figure 4. Rickettsia prowazekii metabolic pathway

Step 4. On the KEGG main page enter Pho5 in the search bar, and then click on the sce:YBR093C. The link you have just clicked on is the link to the Saccharomyces cerevisiae PHO5 gene. There were also other PHO5 genes from other organisms retrieved, but as you already know, we are interested particularly in PHO5 from Saccharomyces cerevisiae. The page is displayed as in Figure 5.

As you can see, the entry is pretty simple, but it is as well very complex because it contains a lot of links and references. It contains “standard” sections, as gene name, definition with EC number, organism, references to other databases, protein sequence, nucleotide sequence, position etc. We will focus on the Pathway section. An also interesting section, but not the focus of our present exercise, is the “Orthology” section. The “Nucleotide” section is also interesting because, not only that contains the DNA sequence which can be exported in FASTA format, but contains an option which enables you to retrieve not just the gene sequence itself, but also n nucleotides downstream or upstream of the gene.

Figure 5. The sce:YBR093C entry

Step 5. Click on the sce00740 link in the “Pathway” section. The “Pathway” section contains information about the function of the gene. The product of the gene can be included in metabolism, regulation of metabolic processes, or in regulation of gene expression. The product of the PHO5 gene is included in the riboflavin metabolism. The page we have come in Step 5 contains the riboflavin metabolic pathway. The product of the PHO5 gene is signed in red (Figure 6). This interactive figure has the same functionality as the global metabolic pathway figure, but because of less data shown, the figure is more detailed, which means that

under dots there are actually names of corresponding metabolites, and on the reaction lines there are the EC numbers for the enzyme which carries the reaction. If the enzyme is available in KEGG, the EC number links to this enzyme from the concerned organism. Available enzymes are highlighted in green. By clicking on dot representing the metabolite, it is possible to get to the entry of that metabolite. Now we will return to the YBR093C entry to analyse it further.

Figure 6. The riboflavin metabolism pathway. Pho5 is labelled in red.

Step 6. On the YBR0093C entry page, click on the sce04111 link located in the “Pathway” section. By doing so, the portion of yeast’s cell cycle regulation pathway is shown (Figure 7). This map shows which proteins are involved in the yeast cell cycle. There are proteins which regulate the activity of other proteins, transcription factors, etc. Some proteins are clustered together, which means that those proteins form complexes with each other. As on the pathway figure, the Pho5 gene is marked in red, available genes are also highlighted green and their entries can be retrieved by clicking on them. The lines represent the influence of one protein to other. There are specific lines indicating repression, activation, phosphorylation, etc. Because the lines don’t represent reactions in the usual sense of the word, they don’t link to anything.

Figure 7. The yeast cell cycle. Pho5 is labelled in red.