splicing to remove intronic sequence parts from protein coding genes in higher ... evidence and support with the progress of the Human Genome Project [2]. .... information from Ensembl [10] and SMART [11] revealed that the alternative ...
Conception and implementation of a MySQL based analysis environment for tissue specific alternative splicing Ralf H. Bortfeldt, Heike Pospisil, Alexander Herrmann, Jens Reich Max-Delbrück-Center for Molecular Medicine (MDC), Department of Bioinformatics, Robert-Rössle-Str. 10, D-13125 Berlin {ralf.bortfeldt,pospisil,alexander.herrmann,reich}@mdc-berlin.de Alternative splicing. Tissue specific splicing. Computational sequence analysis. MySQL database.
Introduction We report here the establishment of an analysis tool for tissue specific alternative splicing. The mechanism of splicing to remove intronic sequence parts from protein coding genes in higher eucaryotes has already long been known [1]. The concept of alternative splicing as a mechanism to create a high diversity of functional proteins in mammals has found increasing evidence and support with the progress of the Human Genome Project [2]. Investigations based on human sequence material (experimental data) and computational methods suggest 35-59 % of the identified genes to be alternatively spliced in conjunction with cellular processes [2,3,4]. Various approaches to detect computationally alternative splice forms (ASFs) are based on expressed sequence tags (ESTs) which exhibit with roughly 70% the highest sequence fraction in GenBank and contribute the major resource for new gene discoveries [5]. The completion of the human genome turned the focus of scientific interests progressively more to the proteom whereas the detection of ASFs makes an important step on investigating the protein diversity. In this context ESTs provide a window towards spliced pre-mRNAs and thus a transcribed gene, allowing predications about the gene expression under specific conditions. Determining the conditions under which an mRNA was isolated comprises the collection and classification of information belonging to the corresponding ESTs as e.g. the tissue origin, the developmental stage or the association with diseases as cancer. Based on an algorithm developed at the MDC and previously reportet in context with the IGMS (Integrated Genetic Map Service) [6,7] a quantum of alternative spliced ESTs is predicted for the present set of mRNAs in GenBank and accessible in the new developed database EASED (Extended Alternative Spliced EST Database) [8]. Methods The present work provides another module for EASED with extended information about the tissue type and anatomical origin of the alternatively spliced ESTs (AS-ESTs). An extensive computational screening of the GenBank entries of the identified AS-ESTs yielded a finite number of redundant and non-uniformely annotations, describing the histoanatomical origin of these ESTs. The data is stored in the “tissue”-module of EASED which is implemented in MySQL and contains various core tables for the nine best investigated model organisms Arabidopsis thaliana, Bos taurus, Caenorhabditis elegans, Drosophila melanogaster, Danio rerio, Homo sapiens, Mus musculus, Rattus norvegicus and Xenopus laevis. These tables provide the actual set of ASESTs with their tissue annotation, the numbers of mRNAs and AS-ESTs found for each tissue and also the number of tissues that are represented by the AS-ESTs matching a given mRNA. In total the database module portraits a flexible environment for different queries on AS-EST-distribution over certain tissues and among the nine model organisms. Starting with the core tables further investigations where performed to test the analysis environment. A revision of the annotations resulted in a significantly reduced number of super classes or categories, grouping redundant annotations by means of diction and tissue functionality. Thus, the number of AS-events for each tissue category could be taken to calculate the frequency of ASFs among the total number of ESTs found in that category. In case of human and mouse a separation of “cancerous” tissues into sub-tables was performed by using a set of keywords to find AS-ESTs with annotations referencing to cancer. With the flexible MySQL environment it is easy possible to select mRNAs with a high coverage of matching ASFs combined with a low spread of these ESTs over different tissues. In order to consider these numbers but also the fluctuating total EST coverage of each participating tissue, a scoring measure ηAS was defined. This measure is based on the Shannon’ “entropy of probabilities” [9] and describes how much the ASFs, matching a certain mRNA, represent a many or a few tissues. Thus the entropy ηAS allows the ranking of mRNAs, according to the tissue specificity they convey by their matching alternative splice forms. Similarly the entropy ηNAS for non-alternative spliced ESTs (only human) was calculated to form the difference between both measures and consider by this step also the tissue specificity of the constitutively spliced ESTs matching the same mRNA.
n
η AS ( Ψ) = −∑ i =1
Fig. 1
ASi ( Ψ) ASi ( Ψ) ⋅ ln AS ( Ψ) AS ( Ψ)
ηAS = entropy of AS incidence
∆η = η AS − η NAS Fig. 2
Ψ = mRNA – subject of EST alignment n = number of represented tissues ASi = frequency of alternative spliced EST in tissue i AS = frequency of total EST matching mRNA Ψ
∆η = measure to evaluate the relationship between tissue specificity of ASF and NASF matching the same mRNA
Results and Discussion We can conclude several interesting findings from tests of our database module. It was shown that the frequency of ASFs does not obey a trend along with the total number of ESTs in the corresponding tissue category thus the coverage does not influence the predicted number of ASFs in the established tissue categories. This fact was taken as basis to compare the AS frequency among different tissue categories. For example in human tissues annotated as normal breast tissue a frequency of 0.5% ASF were found whereas in human cancerous breast tissue the frequency appeared with 2.0% about four times that much. We take this as an evidence for increased alternative splicing in the cancerous form of breast tissue, explainable by the increased requirement of proteins in high proliferation states. Other organisms as D. rerio or B. taurus showed much less frequency of AS events compared to human but also clear differences between different tissue categories. For example “ gastrointestinal tissue” of cow displayed a more than four times higher AS frequency than ESTs from “ kidney “ or “ brain” but ranges with 0.2% much behind frequencies found in human. Here the differences in the magnitude of the AS frequency may be explainable by generally lower EST coverage in cow or cebrafish compared to human or mouse. For a considerable amount of ESTs no tissue annotation existed or could be extracted from their GenBank entries, decreasing the resolution of tissue distribution of ASFs. One mRNA (AY032677) was queried according to the condition to represent only one tissue annotation by its matching ASF. Although this mRNA had a much less coverage of AS-ESTs and non AS-EST the entropy measure expresses exactly what is expected - a low ∆η because all the ASFs emerge from the same tissue whereas non-alternative spliced ESTs, identified at the same mRNA, are found in a variety of other tissues. A further analysis involving additional and confirmed information from Ensembl [10] and SMART [11] revealed that the alternative splicing of the candidate mRNA AY032677 probably affect its gene product (DNA Polymerase Theta) negatively because of two skipped exons situated in the region of a functional domain (POLAc). The entropy η promises also a reasonable measure for future investigations, dealing with the assessment of alternative splicing of genes in different tissues because it considers three crucial quantities for a gene-representing mRNA: the number of matching ASFs, the number of tissues represented by these matching ASFs and the total number of matching ESTs. References [1] Knippers, R. Molekulare Genetik. 8.Auflage, Thieme Verlag Stuttgart, (2001) [2] The International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001). [3] Brett, D. et al. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Letters 474, 83-86 (2000). [4] Mironov, A. A. et al. Frequent Alternative Splicing of Human Genes. Genome Research 9, 1288-1293 (1999) [5] Benson, D.A. et al. GenBank. Nucleic Acids Research 30, 17-20 (2002). [6] Pospisil, H. et al. A database on alternative splice forms on the Integrated Genetic Map Service (IGMS). In Silico Biology 3. (2002). [7] http://www..bioinf.mdc-berlin.de/igms [8] http://www..bioinf.mdc-berlin.de/splice/db [9] Shannon 1948 “ A Mathematical Theory of Communication” [10] Hubbard, T. et al. The Ensembl genome database project. Nucleic Acids Research 30, 38-41 (2002). [11] Letunic et al. Recent improvements of the SMART domain-based sequence annotation resource. Nucleic Acids Research 30, 242-244 (2002).