Journal of Bioinformatics and Computational Biology © Imperial College Press
A WORKFLOW FOR MUTATION EXTRACTION AND STRUCTURE ANNOTATION KANAGASABAI RAJARAMAN* Department of Data Mining, Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613.
[email protected] KHAR HENG CHOO† Department of Data Mining, Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613.
[email protected] Department Of Biochemistry, Yong Loo Lin School Of Medicine, National University Of Singapore, 8 Medical Drive, Singapore 117597. SHOBA RANGANATHAN Department of Chemistry and Biomolecular Sciences & Biotechnology Research Institute Macquarie University, NSW 2109, Australia.
[email protected] Department Of Biochemistry, Yong Loo Lin School Of Medicine, National University Of Singapore, 8 Medical Drive, Singapore 117597. CHRISTOPHER J. O. BAKER‡ Department of Data Mining, Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore 119613.
[email protected]
Received (23rd July 2007) Revised (11th September 2007) Accepted (30th September 2007) Rich information on point mutation studies is scattered across heterogeneous data sources. This paper presents an automated workflow for mining mutation annotations from full-text biomedical literature using natural language processing (NLP) techniques for subsequent reuse in protein structure annotation and visualization. This system, called mSTRAP, (Mutation extraction and STRucture Annotation Pipeline) is designed for both information aggregation and subsequent brokerage of the mutation annotations. It facilitates the coordination of semantically related information from a series of text mining and sequence analysis steps into a formal ontology. The ontology is designed to support application specific data management of sequence, structure and literature annotations which are populated as instances of data type properties. mSTRAPviz is a
*
Joint first author Joint first author Corresponding author
† ‡
1
2
Author’s Names subsystem that facilitates the brokerage of structure information and the associated mutations for visualization. For mutated sequences without any corresponding structure available in the Protein Data Bank (PDB), an automated pipeline for homology modeling was developed to generate the theoretical model. With mSTRAP, we demonstrate a workable system which can facilitate automation of the workflow for the retrieval, extraction, processing and visualization of mutation annotations, a task which is well known to be tedious, time-consuming, complex and error-prone. The ontology and visualization tool are available at http://datam.i2r.a-star.edu.sg/mstrap. Keywords: Text Mining; Natural Language Processing; Ontology; Homology Modeling; Visualization; Workflow.
1. Introduction Mutations can be caused by transcription errors in the genetic material, deliberate cellular control, damage due to radiation or chemical mutagens, and by retroviral infection. Reported mutations are also introduced deliberately through rational design and directed evolution. A small percentage of mutations confer advantage though majority of mutations tend to have neutral or deleterious impacts leading in some cases to disruption of molecular function or biological processes and possible disease in higher organisms. Numerous studies have been conducted on mutations over the years generating rich and abundant sources of information 1-8. Unfortunately, the information is scattered in a variety of resources, including public and private biomedical literature databases. To access this information the researcher must scour the disparate resources for relevant citations. In order to convert the information into actionable knowledge, the researcher has to map a mutation to its associated protein, often requiring specific expertise or domain knowledge. The task is further complicated by the heterogeneous document formats, languages style and the slow adoption of common and standardized vocabulary to describe the mutations. As a result, integrating this information is time-consuming and susceptible to errors. The challenge is therefore to extract the information from the disparate raw-text literature and existing structured sources such as mutation databases and present it as an organized and integrated view with minimal intervention from the users. Up until recently this has been the mandate of manual curation projects 8, 9 which have resulted in the creation of domain specific mutation databases which now exist in their hundreds, despite their eventual obsolescence when funding for curation is not renewed. In such cases these repositories of legacy data are only updated by enthusiastic end users. Running in parallel to this trend, a series of independent initiatives have emerged which focus on developing automated techniques to facilitate the extraction of mutation specific annotations from texts for subsequent reuse in bioinformatic analyses. Through experimentation with machine learning, natural language processing, ontologies and other AI technologies, automated mutation extraction is becoming a more mature and realizable goal. The following important initiatives have pioneered the use of text mining techniques to recognize, extract and ground mutation entities. In early work MuteXt 10 focused on single-point mutations for G protein-coupled receptors and nuclear hormone receptors, reporting mutation extraction performance. Rebholz-Schuhmann et al.11 focus primarily on the extraction of disease related mutation-gene pairs. MutationFinder 12 reports a rule-based system for finding point mutation mentions from abstracts. Baker and Witte 13 discuss the Mutation Miner system and a mutation extraction-dependent task, namely mutated-protein structure annotation. Further to this, Gabdoulline et al.
Instructions for Typing Manuscripts (Paper’s Title)
3
report the customization of the ProSAT structure annotation tool 14 to support the upload of mutation annotations from selected fields in the BRENDA database or from a custom designed XML format. While addressing many relevant issues for this application domain, in our development of mSTRAP we go beyond the description of a system architecture, mutation extraction or evaluation of core components of such a mutation pipeline. We address data interoperability issues so that mutation annotations are made available in a semantically consistent format, as instances of a formal OWL-DL ontology. Furthermore we introduce mSTRAPviz which can receive inputs in this format to annotate native protein structures or homology models corresponding to the mutated sequences described in the literature. This paper is organized as follows: Section 2 describes the typical mutation extraction structure annotation scenario and an overview of ontology centric workflow we have designed to support the information management needs of this application domain. Sections 3 and 4 detail the design of the mSTRAP ontology and the steps in its population. This depends critically on the coordination of the following technical steps; text mining, ontology instantiation, mining and refinement, protein sequence retrieval and mutation mapping, and protein structure retrieval. Section 5 illustrates the use of the populated ontology in homology modeling and structure visualization. 2. Task Description When scanning PubMed15 literature for information on a particular protein (or protein family) the protein engineer or a structural biologist typically has a specific research question in mind. Of critical importance is the interpretation of new insights from experimentally introduced perturbations, such as point mutations, in the context of the researcher’s current understanding of a protein, its properties and its structure. How does the newly reported mutation directly or indirectly affect the kinetics of protein-ligand binding or the conformation of an interaction interface? To embark upon such an investigation the user may, for example, record the mutation annotations in a paper and retrieve the sequence information from sequence databases such as GenBank 16 or SwissProt/UniProt 17, and where available, structure information from the Protein Data Bank (PDB) 18. The user then identifies the positions that the mutant residues occupy on the structure by inspecting the structure file. Before proceeding with modeling the 3D structure of the mutated protein, the user may need to address issues such as the remapping of the coordinates of the residues to accommodate targeting signals found in the sequence. Additionally, inconsistent sequence-structure information can occur in the PDB file 21. From this simplified use case, we observe the following: (a) multiple steps need to be carried out, (b) integration of data from various sources are required, (c) thorough checks are required to avoid propagating errors to the subsequent steps rendering the results unusable, (d) many components in the process readily lend themselves to automation.
4
Author’s Names
Figure 1: A schematic illustration of the mSTRAP system architecture
Here we describe a multi-tier system to automate the workflow required to support the defined scenario. The system is capable of fetching documents from PubMed on the basis of specific keywords and processing the fetched documents or documents input directly by the user. The documents are parsed to extract the relevant annotations and mutation information. Information, such as sequence and similarity search results are generated by third party tools and stored as instances of the OWL-DL ontology 19. The ontology is later queried and the information undergoes application-specific data processing where the mutated residues are mapped to the structure before being displayed in a 3D visualization. In general, the system coordinates two main pipelines (See Figure 1): (i) the ontology population workflow comprising of document retrieval, information extraction, data integration, and ontology instantiation (ii) the ontology employment workflow comprising of query of the populated ontology, protein structure coordinate retrieval, homology modeling if the structure is unavailable, mutant residue mapping and protein structure visualization. Technically, it is feasible to integrate the two pipelines; however, we left them separated as we shall discuss in Section 6.
3. Ontology Design Ontologies are vehicles of knowledge representation which conceptualize a domain in terms of concepts, relations, and instances as well as constrains on concepts. Such formalisms have increasingly been deployed in the life science information systems which depend on access to expressive knowledge representations of domain specific knowledge28. Some life science ontologies are designed to integrate information about
Instructions for Typing Manuscripts (Paper’s Title)
5
Figure 2: Conceptualization of the mSTRAP ontology (object properties in dark ovals, datatype properties in grey ovals) displayed using Growl (http://ecoinformatics.uvm.edu/dmaps/growl)
resources as well as domain knowledge29,30. The mSTRAP ontology was designed to facilitate data integration in our focused application domain and was scripted in the W3C endorsed standard, the Web Ontology Language (OWL)19 using TopBraid Composer (http://www.topbraidcomposer.com/). The conceptualization captures core concepts and relations (as object properties) relevant to mutation extraction from the literature and corresponding protein sequence retrieval. These are sufficient to facilitate user navigation of the instance data provided by the mutation extraction step. More extensive information about the data aggregated by the workflow is captured in data type properties. Figure 2 shows the full conceptualization also distinguishing object and data type properties. The ontology whilst being relatively simple and application specific, anticipates the import of richer, more specific knowledge representations. For example the Protein concept can be extended to import the Protein Phosphastase ontology31 which describes families of protein based on protein domain architectures. This could facilitate DL reasoning to assign text mining-derived mutation annotations to the appropriate Protein Phosphatase classes. 4. Ontology Population Workflow In this section we describe the ontology population workflow that comprises content acquisition, natural language processing, sequence retrieval, sequence analysis and ontology instantiation strategy. Primarily ontology instances are generated by identifying data entities from a multi-step process, namely: (i) parsing full texts to extract protein
6
Author’s Names
names and point mutations using an in-house text mining toolkit, called the BioText Suite (http://research.i2r.a-star.edu.sg/kanagasa/BioText/), which performs text processing tasks such as tokenization, part-of-speech tagging, named entity recognition, grounding and relation mining, (ii) instantiating the ontology with the entities extracted in the previous step, (iii) mining the ontology instances and re-instantiating them based on a cross validation between protein sequences retrieved from an NCBI protein name search and the successful mapping of text-derived mutations onto the sequences and (iv) retrieving protein structure meta data corresponding to the mutation-mapped protein sequences. 4.1 Text Mining Content Acquisition: Our content acquisition engine takes user keywords and retrieves full-text research papers through PubMed search, parses the search results and crawls the publishers’ websites. Collections of research papers are converted from their original formats, e.g. pdf to ascii text and passed to the text mining system. Named Entity Recognition: BioText processes retrieved full-text documents and recognizes entities using a gazetteer component. The gazetteer uses term lists and regular patterns to match against the token of a processed text and tags the terms found. During gazetteer lookup, the ontology class of the term is also added to the document markup as an attribute. This is used later during the instantiation process to identify the right ontology class for population. The manually curated Protein name list from Swiss-Prot was used for identifying proteins found in literature and further consolidated by combining all canonical names and synonyms. Point mutations were extracted using a simple regular expression of the form [A-Z]([a-z][a-z])?\-?\d+[A-Z]([a-z][a-z])?
(1)
Some false positives are eliminated using two rules: (i) both wild type and mutant residues must match valid names/letters (e.g. O9X will be recognized as a false positive). (ii) wild type and mutant residues must be different. This is an approach akin to, e.g. MutationFinder12, which employs more complex regular expressions and rules to improve precision and recall. Our approach is different in that it is based on a high-recall step to extract point mutations from the full texts and relies on a novel ontology mining and refinement method (described in Section 4.3) to improve precision. Normalization and Grounding: Entities recognized in the previous step are normalized and grounded to the ‘canonical names’, before instantiation. Protein names are normalized to the canonical names entry in Swiss-Prot and the grounding is done using the Swiss-Prot ID. Note that the normalization is well-defined as the gazetteer exclusively uses the Swiss-Prot curated name list for protein name recognition. Point mutations are normalized to the wNm format, where w is a one-letter wildtype residue, m is a one-letter mutant residue and N is a numeric. This is done by a simple mapping of three-letter to one-letter amino acid coding, for all mutations containing three letter codes.
Instructions for Typing Manuscripts (Paper’s Title)
7
Figure 3: The ontology population and instantiation workflow.
Relation Detection: In this step we identify the Mutation-Protein relations between the normalized entities. We adopt a simple relation mining approach whereby two entities are said to be related if they co-occur in a sentence. Thus, every document is parsed to extract sentences and then co-occurrence detection is invoked. All other sentences are skipped. From the resulting collection, Mutation-Protein pairs are returned along with the respective sentences in which they co-occur. 4.2. Ontology Instantiation The ontology is instantiated with the entities extracted from the previous steps as follows. The grounded entities are instantiated as class instances into the respective ontology classes (as tagged by the gazetteer), and the relations detected are instantiated as Object Property instances. We wrote a custom script using the Jena API (http://jena.sourceforge.net/) for this purpose. 4.3. Ontology Mining and Re-instantiation To reduce the false positive Mutation-Protein pairs extracted by the co-occurrence based detection algorithm (See Section 4.1), here we define a further subroutine, which uses the populated ontology in combination with a sequence retrieval and mutant mapping algorithm described below. This approach is shown to significantly enhance precision in the identification of the correct protein sequence reported in the original text, in Section 4.5.
8
Author’s Names
4.3.1 Protein Sequence Retrieval and Mutation Mapping Given a protein and the point mutations associated with it, a query for the protein name is performed on the GenBank protein sequence database and FASTA formatted sequences retrieved, using the Entréz Programming Utilities EUtils (http://www.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html) provided by NCBI. Subsequently the mutations are mapped onto retrieved sequences using an offset analysis. We do this using the following formula where, w1N1m1, w2N2m2, · · ·, wkNkmk are the k mutations to be mapped, sorted in the ascending order of Ni. Offsets between adjacent mutations are calculated and the following regular expression is constructed: w1·{f1}w2·{f2}· · ·{fk-1}wk
(2)
where fi = N(i+1) – Ni. We apply this regular expression to the FASTA formatted sequences and return the matched sequences. This algorithm is coded as a subroutine called RetrieveAndMap( ).
4.3.2 Algorithm for Ontology Mining and Re-instantiation This routine, called OntoMineAndRefine(), iterates over the following steps, for each protein found in every document. 1. Select the mutants of the protein using the proteinMutantRel and mutantOccursIn object properties in the ontology. 2. Assign each Mutant-Protein pair with a score equal to the sentence frequency of this pair in the document. 3. Pick the top two scoring Mutant-Protein pairs and call the subroutine RetrieveAndMap( ) with them as arguments. 4. If at least 1 hit is returned, include the next ranked Mutant-Protein pair(s) and call RetrieveAndMap( ) again. 5. Repeat Step 4 until no hits are returned or no more Mutant-Protein pairs are available. 6. In the former case, the last considered pair and all lower ranked pairs are marked as false positives and uninstantiated from the ontology together with all sentences in which they occur. 7. With the remaining mutants of the protein in the ontology, call RetrieveAndMap( ) to retrieve the top hit and instantiate it under the Sequence class. 8. Instantiate object properties proteinSequenceRel and mutantSequenceRel also to establish relationships with the corresponding protein and mutants. Create datatype property instances for the GenBank ID and the sequence content (FASTA). Through an iterative ontology mining and sequence analysis, OntoMineAndRefine() filters out the spurious Mutation-Protein pairs identified during the relation detection and
Instructions for Typing Manuscripts (Paper’s Title)
9
later calls RetrieveAndMap() using the remaining instances to retrieve the protein sequences and instantiates them. Note that, unlike in 13, we do not require the source organism information be identified from the text. 4.4. Protein Structure Retrieval For each protein sequence instantiated in the ontology, a BLAST 20 search is performed (using the NCBI blastp utility) to identify the closest sequence-structural homolog in the blast database of Protein Data Bank structure sequences. The details of the top 3 hits are populated in the ontology. For each instance of a structure the PDB ID, protein structure name, e-value and % sequence identity are captured in datatype properties. 4.5. Population Performance Analysis To evaluate the ontology population pipeline, it is desirable to validate on a large collection of documents from PubMed. However, this task requires manual curation of all documents to annotate mutations and proteins, and build a corpus. For this paper, we have used the Protein Mutant Database (PMD)9 which contains details of mutated protein sequences, their associated mutations and manually generated summaries of mutation impacts, as the gold standard. We collected 47 PMD records reporting mutated 'phosphatases' that had at least 2 point mutations and a MEDLINE citation whose full text paper was download-able with our content acquisition engine. With this corpus, we evaluated two tasks: 1) Extraction of Mutation-Protein pairs and 2) Protein sequence retrieval. For both tasks we measured performance by computing precision and recall using the version of the ontology produced after the ontology mining and re-instantiation step. In Task 1, precision was defined as the percentage of true positives among the instantiated Mutation-Protein pairs and recall was redefined as the percentage of the true positives over all correct pairs, whereas in Task 2, precision was the percentage of true positives among the sequences instantiated and recall was the percentage of the true positives over all correct sequences. True positives were identified with reference to the PMD records. The results are presented in Table 1. The experiments were carried out using a 3.6GHz Xeon Linux workstation with 4 processors and 8GB RAM. Table 1: Performance of ontology population
Task Task 1: Mutant extraction Task 2: Protein sequence retrieval
Precision 83% 78%
Recall 72% 69%
We observed very good performance on the mutant extraction task (Task 1) indicating that our method of combining text mining and protein sequence analysis is effective. An evaluation of the same task before the ontology mining step yielded precision and recall both < 50%. As for the lower recall, we found in later investigation that the gazetteer failed to identify some protein names since they were either missing in
10
Author’s Names
Figure 4: Ontology employment workflow.
Swiss-Prot or no rules were able to capture the name variant. However, rather surprisingly the protein sequence retrieval performance was also high even though mutation extraction recall was not perfect. Our investigation revealed that this is because the number of sequence hits for the proteins found is generally of the order of hundreds or less, and the offset analysis is able to effectively narrow down to the correct sequence. We like to point out that the high performance obtained can only be treated as a preliminary result since we used a rather small corpus. It will be interesting to conduct the same evaluation on a larger corpus built from PubMed documents, with more protein families. The precision and recall under this setup will depend on factors including the performance of protein name recognition; the offset analysis method used for mutation mapping and the iterative ontology mining algorithm, and requires thorough study. Our current work focuses on this step.
5. Ontology Employment Once the ontology is instantiated by the automated text-mining pipeline, we use a protein structure visualization program, mSTRAPviz to query the ontology and display the annotations. The workflow consists of (i) load and query the ontology to obtain the mutation annotations, sequence and structure information (ii) if structure information is available, retrieve the protein structure coordinates from the PDB, otherwise select a
Instructions for Typing Manuscripts (Paper’s Title)
11
suitable template sequence for homology modeling from the BLAST result in the ontology to generate a theoretical model (iii) map the mutant residues to the protein structure and color them before visualizing the protein structure with the mutation annotations (refer to Figure 4). Each of these steps is handled by the various modules within mSTRAPviz: an OntologyLoader loads the ontology file and stores the retrieved information into the application data structure; a simple browser OntologyNavigator which displays sequences names, mutations and corresponding sentences retrieved from the ontology from which the end user chooses which proteins to visualize; a ClustalW_Handler to interface between the alignment tool and mSTRAPviz and a MODELLER_ FileHandler to interface between MODELLER and mSTRAPviz.
5.1 Querying the Ontology The OntologyLoader uses the Protégé-OWL Application Programming Interface (API) (http://protege.stanford.edu/) to query the ontology to populate its data model and facilitate the transfer of mutation annotations to the protein structure for visualization. The OntologyLoader first queries the Sequence class to retrieve all the sequences that were instantiated. Sequences without any BLAST results are omitted. The sequence information including the BLAST results (considered as homologous sequences), GenBank ID and FASTA-formatted amino acid sequence are obtained using the hasBlastResult, hasGenBankID and hasFASTAFormat properties respectively (refer to Table 2). The hasBlastResult object property is queried to access details of the BLAST results. Actual data is obtained from the datatype properties hasResultID, hasResultEValue, hasResultRank. Mutant residues are queried via the Mutant class after the Sequence object is populated. Table 2: Datatype property and object property under the Sequence class as represented in the ontology.
ID Name hasBlastResult hasGenBankID hasFASTAFormat hasResultID hasResultEValue hasResultRank
Property type Object Datatype Datatype Datatype Datatype Datatype
Description Indicates the existence of BLAST results Stores the GenBank ID for the sequence Stores the FASTA-formatted sequence BLAST result ID The e-value for the BLAST result The rank order as appeared in the BLAST result
Although the ontology contains most of the information required by mSTRAPviz, additional information which is specific to the application domain such as protein structure coordinates, template sequence and signal peptide have to be retrieved separately. Each sequence queried from the ontology, is checked for the presence of signal peptide by comparing the sequence against the sig_peptide field located in the GenBank entry. Mature peptide does not contain signal peptide and will not appear in the 3D structure, hence this short peptide may have to be removed depending on the coordinate
12
Author’s Names
information and the coordinates for the mutant residues have to be adjusted 10. There are other post-translational modifications which we do not address in this work, for such cases, we are unable to process them. Next, the relation mutantSequenceRel or sequenceMutantRel is checked to ascertain if a mutation occurs in a sequence. If no such relation exists, the sequence is skipped. In the case where the mutation information is available, the wildtype and mutant residues are identified using hasWildtypeResidue, hasMutantResidue (see Table 3) which specify the 1-letter amino acid code of the residue. The coordinate or position information of the mutant residue is obtained using hasWildtypeXCoordinate where X can be any of the 20 amino acids 1-letter code. Simultaneously, the documents and the sentences that mention the mutation are queried and stored as Document and Sentence objects respectively using hasContent, hasTitle, hasJournal and hasAuthors, all of which are defined as datatype property. Storing all this information completes the information retrieval from the ontology into the application data structure of mSTRAPviz. Table 3: Datatype properties described in the Mutant class.
ID Name mutantSequenceRel sequenceMutantRel hasWildtypeResidue hasMutantResidue hasWildtypeXCoordinate
Description The existence of mutation(s) in a sequence The inverse representation of mutantSequenceRel. A wildtype residue represented in 1-letter amino acid code The mutant residue represented in 1-letter amino acid code A position in the sequence where a mutant residue occurs
Once the application data model is fully instantiated, a simple ontology browser OntologyNavigator presents the list of sequences and annotations to the user for selection. If a protein structure is available for the user’s selected sequence, the system will proceed to map the mutant residues to the structure and display it with the corresponding annotations. If no experimental protein structure exists for the selected sequence then homology modeling is employed to provide a theoretical model for that sequence. The system also accepts user’s manual input of PDB code in addition to the default structure which is the top hit returned from the BLAST results against PDB. 5.2 Automated Homology Modeling In the text-mining pipeline, sequence similarity searches against the PDB are conducted and the top three hits are stored in the ontology for every sequence. In the mSTRAPviz application, we extract the best hit defined as a hit with e-value lesser than 0.01 as a suitable candidate for homology modeling. Any sequences that do not meet the criteria are skipped. If a suitable template sequence (with its associated experimental structure) specified by the PDB ID in the ontology exists, then mSTRAPviz retrieves the protein structure file from the PDB via the Web and relays it to the MODELLER_FileHandler module, which is employed to generate the necessary scripts for automating homology modeling. The nature of PDB files makes it particularly challenging to obtain accurate mapping between the PDB ATOM (3D coordinates, residues and atom information) and SEQRES (residues information) as detailed in 21. Fortunately, MODELLER provides the facility (through the model_segment( ) subroutine) to extract the sequence information
Instructions for Typing Manuscripts (Paper’s Title)
13
directly from the PDB ATOM records in the PDB file. The target (from ontology) and the template sequence (extracted from the PDB file through MODELLER_FileHandler) are aligned in global alignment mode using CLUSTALW 22 with the default parameters. The MODELLER_FileHandler reads the aligned sequences and interfaces with MODELLER to perform homology modeling on the target-template aligned sequences. Three homology models are generated and their quality are assessed using a statistical potential method called the Discrete Optimized Protein Energy (DOPE) 23, included in the MODELLER program. The model with the lowest DOPE score is considered as the likeliest protein model for the given alignment and template. It must be noted that DOPE score has arbitrary scale and it is un-normalized with respect to protein size, hence should be interpreted with caution. 5.3 Visualization with mSTRAPviz By this stage, it is assumed that the protein structure coordinates have been obtained and the structure can be either an experimental protein structure from the PDB or a theoretical model generated using homology modeling. The mutated residues are identified, colored and mapped to the appropriate positions in the protein structure. Figure 5 shows the screenshot of mSTRAPviz. There are three main panels (A) protein structure panel (B) sequence panel (C) documents and sentences panel. The protein structure is displayed in the structure panel using the open source Java viewer Jmol (http://www.jmol.org) which is highly programmable and customizable. The underlying sequence is shown in the sequence panel. The mutant residues are highlighted in red on the sequence and labeled in color on the structure. Clicking on the mutant residue in the sequence panel will set the focus on the mutant residue and display the associated annotations including the journal article and the sentences that mentioned the mutant residues. Users are able to interact with the structure using the standard Jmol features.
6. Discussion In this paper we present a workflow for mining point mutation annotations from full text documents and making them visible in the 3D context of a protein structure. We make freely available the ontology with the ontology-enabled visualization software and provide a sample instantiated ontology containing the annotations pertaining to mutated phosphatases. The infrastructure that we provide for this workflow uses contemporary techniques applied in a number of novel ways and there are multiple avenues for contrasting and assessing the individual component technologies. Here we shall address the main novel contributions of this work. Primarily we customize an existing text mining application for use in the mutation extraction domain. We acknowledge a number of systems which have been reported that address the extraction of mutation mentioned in texts. However only a few systems 10, 24 describe the grounding of extracted mutations onto the corresponding sequences and this is an important step when the use of extracted mutation and annotations sentences goes beyond their deposition in databases. In our context the goal is to present the information
14
Author’s Names
Figure 5: mSTRAPviz for visualizing protein structure and the mutation and sentences annotations.
in a robust manner such that it can be translated and presented in a visual context, the results of which are biologically meaningful to the protein engineer. Our approach illustrates that we can ground mutations effectively permitting only the export of validated mutations and their parent sequences as instances in the ontology. To achieve this we use a novel approach that involves iteratively mining pairs of proteinmutants instantiated into the ontology, that were deposited by first parse text extraction system. This refinement of the ontology instances based on the pattern matching of mutants on sequences is unprecedented and was shown to increase number of real positives. Secondly we employ an infrequently reported approach in which the results of text mining are instantiated directly to a custom designed formal ontology. In previous work Baker et al. 13 instantiated lipid names derived from text mining of lipid literature into a custom designed lipid ontology. Clearly more relevant is the report by Witte, Kappler, and Baker 24 who also export text mining sentences as instances to the Mutation Miner ontology. Their work highlighted two things: the suitability of instantiated ontologies as effective repositories of integrated information, such as word lists, to assist in text mining pipelines and secondly the use of this approach to provide a semantically rich query model for the direct navigation and review of text mining-derived sentences by the protein engineer, who would inspect the mutation impact annotations before deciding to
Instructions for Typing Manuscripts (Paper’s Title)
15
proceed further with subsequent analysis. While ontologies are readily extensible, at present the Mutation Miner ontology does not support the instantiation of sequence analysis results for reuse by in structure visualization. Consequently we can state that using our ontology we uniquely address the data integration challenges of our application domain by using a custom designed formal ontology written in the W3C recommended standard, the web ontology language (OWLDL). This language enables us to use formal semantics to define the metadata exchanged between components of the system, both from the custom text mining and third party sequence analysis tools. Moreover the OWL language supports the interoperability and reusability of our results by the bioinformatics community at large that use off-the-shelf APIs such as Jena, OWL-API, Protégé and highly expressive description-logic based ontology query engines including FaCT (http://www.cs.man.ac.uk/horrocks/FaCT/), Pellet (http://pellet.owldl.com) and Racer (www.racer-systems.com/)25 which are in some cases made more accessible with graphical interfaces 26, 27. This position is also clearly stated by other authors 24. Furthermore our system is also able to identify whether mutated sequences that we have identified using a literature-driven approach have a corresponding protein structure. Given that a sequence has no structure, we recruit state of the art third party homology model software to build a model for which annotations are present in the ontology. The final novelty is the design of mSTRAPviz as an applet which reads from our OWL-DL standard ontology.
7. Conclusion This paper describes the effective combination of multiple semantic technologies: text mining, ontologies, instantiated knowledgebase creation and the distribution of information in accepted interoperability standards which serve to revise the predominant model for human reader, document-centric access to the mutation literature. It contributes a novel workflow, supporting infrastructure with customized end user functionality which can deliver actionable in-context mutation summaries to the scientist. Our ongoing work involves collaborations with structural biologists to further assess user requirements and additional text-extractable sources of information suitable for protein structure annotation. References 1. M. Olivier, R. Eeles, M. Hollstein, M. A. Khan, C. C. Harris and P. Hainaut. The IARC TP53 database: new online mutation analysis and recommendations to users. Hum Mutat 19, 607-614. (2002). 2. S. Bamford, E. Dawson, S. Forbes, J. Clements, R. Pettett, A. Dogan, A. Flanagan, J. Teague, P. A. Futreal, M. R. Stratton and R. Wooster. The COSMIC (Catalogue of Somatic Mutations in Cancer) database and website. Br J Cancer 91, 355358. (2004). 3. D. Hamroun, S. Kato, C. Ishioka, M. Claustres, C. Beroud and T. Soussi. The UMD TP53 database and website: update and revisions. Hum Mutat 27, 14-20. (2006).
16
Author’s Names
4. B. Niesler, C. Fischer and G. A. Rappold. The human SHOX mutation database. Hum Mutat 20, 338-341. (2002). 5. K. A. Stenberg, P. T. Riikonen and M. Vihinen. KinMutBase, a database of human disease-causing protein kinase mutations. Nucleic Acids Res 27, 362-364. (1999). 6. P. D. Stenson, E. V. Ball, M. Mort, A. D. Phillips, J. A. Shiel, N. S. Thomas, S. Abeysinghe, M. Krawczak and D. N. Cooper. Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21, 577-581. (2003). 7. E. G. Tuddenham, R. Schwaab, J. Seehafer, D. S. Millar, J. Gitschier, M. Higuchi, S. Bidichandani, J. M. Connor, L. W. Hoyer, A. Yoshioka and et al. Haemophilia A: database of nucleotide substitutions, deletions, insertions and rearrangements of the factor VIII gene, second edition. Nucleic Acids Res 22, 4851-4868. (1994). 8. D. Fredman, M. Siegfried, Y. P. Yuan, P. Bork, H. Lehvaslaiho and A. J. Brookes. HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res 30, 387-391. (2002). 9. T. Kawabata, M. Ota and K. Nishikawa. The Protein Mutant Database. Nucleic Acids Res 27, 355-357. (1999). 10. F. Horn, A. L. Lau and F. E. Cohen. Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics 20, 557-568. (2004). 11. D. Rebholz-Schuhmann, S. Marcel, S. Albert, R. Tolle, G. Casari and H. Kirsch. Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res 32, 135-142. (2004). 12. J. G. Caporaso, W. A. Baumgartner, Jr., D. A. Randolph, K. B. Cohen and L. Hunter. MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics 23, 1862-1865. (2007). 13. C. J. O. Baker and R. Witte. Mutation Mining—A Prospector's Tale. Information Systems Frontiers 8, 47-57. (2006). 14. R. R. Gabdoulline, S. Ulbrich, S. Richter and R. C. Wade. ProSAT2--Protein Structure Annotation Server. Nucleic Acids Res 34, W79-83. (2006). 15. J. McEntyre and D. Lipman. PubMed: bridging the information gap. Cmaj 164, 1317-1319. (2001). 16. D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell and D. L. Wheeler. GenBank. Nucleic Acids Res 35, D21-25. (2007). 17. C. H. Wu, R. Apweiler, A. Bairoch, D. A. Natale, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, M. J. Martin, R. Mazumder, C. O'Donovan, N. Redaschi and B. Suzek. The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 34, D187-191. (2006). 18. H. M. Berman, T. N. Bhat, P. E. Bourne, Z. Feng, G. Gilliland, H. Weissig and J. Westbrook. The Protein Data Bank and the challenge of structural genomics. Nat Struct Biol 7 Suppl, 957-959. (2000). 19. M. K. Smith, C. Welty and D. L. McGuinness. OWL Web Ontology Language Guide. (2004). http://www.w3.org/TR/2003/CR-owl-guide-20030818/ 20. S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-3402. (1997).
Instructions for Typing Manuscripts (Paper’s Title)
17
21. J. M. Chandonia, G. Hon, N. S. Walker, L. Lo Conte, P. Koehl, M. Levitt and S. E. Brenner. The ASTRAL Compendium in 2004. Nucleic Acids Res 32, D189-192. (2004). 22. J. D. Thompson, D. G. Higgins and T. J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22, 46734680. (1994). 23. M. Y. Shen and A. Sali. Statistical potential for assessment and prediction of protein structures. Protein Sci 15, 2507-2524. (2006). 24. R. Witte, T. Kappler and C. J. O. Baker. Enhanced Semantic Access to the Protein Engineering Literature using Ontologies Populated by Text Mining. International Journal of Bioinformatics Research and Applications. (2007). 25. V. Haarslev, R. Möller and M. Wessel. KI-2004 International Workshop on Applications of Description Logics (ADL’04). (2004). 26. C. J. O. Baker, X. Su, G. Butler and V. Haarslev. Ontoligent interactive query tool. Canadian Semantic Web (Semantic Web and Beyond: Computing for Human Experience), Springer-Verlag New York, Inc. (2006). 27. A. Fadhil and V. Haarslev. 2007 International Workshop on Description Logics (DL-2007), Bressanone, near Bozen-Bolzano, Italy. (June 2007). 28. P. Lambrix, H. Tan, V. Jakoniene, and L. Strömbäck, Biological Ontologies. In: Semantic Web Revolutionizing Knowledge Discovery in the Life Sciences, 2007 C.J.O. Baker and K. H. Cheung (eds) 29. Baker C.J.O., Shaban-Nejad A., Su X., Haarslev V., and Butler G. Semantic Web Infrastructure for Fungal Enzyme Biotechnologists. Journal of Web Semantics, vol. 4(3), 2006. Special issue on Semantic Web for the Life Sciences. 30. C.A. Goble, Stevens R., Ng G., Bechhofer S., Paton N.W., Baker P.G., Peim M., and Brass A. Transparent Access to Multiple Bioinformatics, Information Sources. IBM Systems Journal Special Issue on Deep computing for the Life Sciences, 40(2), 2001. 31 K. Wolstencroft, R.Stevens, and V.Haarslev. Applying OWL Reasoning to Genomics: A Case Study. In: Semantic Web Revolutionizing Knowledge Discovery in the Life Sciences, 2007 C.J.O. Baker and K. H. Cheung (eds)
18
Author’s Names
Rajaraman Kanagasabai received his Ph.D. in Computational Learning from Indian Institute of Science, Bangalore, India. He has since worked on Information Retrieval, Search engines and Text mining and has published several research papers in top peer-reviewed journals and conferences. He was the leader of the team that won the Tan Kah Kee Young Inventor's award in 2006. Currently he is a Senior Research Fellow in the Department of Data Mining at the Institute for Infocomm Research, and his research interests include Text mining, Ontologies, Semantic Web, Bioinformatics and Web mining. Khar Heng Choo received his BSc. degree in Computer Science from National University of Singapore. He is currently working in the Department of Data Mining at the Institute for Infocomm Research. He is simultaneously pursuing his Ph.D. degree in Science (Bioinformatics) from National University of Singapore.
Shoba Ranganathan received her PhD in Quantum Chemistry from the IIT Delhi, India and continued her post-doctoral work at the Institute de Biologie Physico-Chimique, Paris. She is the Australian Director of the International Society for Computational Biology and Vice-President of the Asia-Pacific Bioinformatics Network. She is also Chairperson of the S* Life Science Informatics Alliance, established to encourage the development of bioinformatics education. She heads her bioinformatics group at Macquarie at the Biotechnology Research Institute and serves as an adjunct professor at the Yong Loo Lin School of Medicine at the National University of Singapore. Christopher Baker is a Principal Investigator at the Institute for Infocomm Research, A*STAR, in Singapore. His current focus is on the application of semantic web and text mining technologies in knowledge-based life science information systems. Prior to his current appointment he held the following scientific leadership roles; Bioinformatics Project Manager of the Génome Québec funded project, 'Ontologies, the semantic web and intelligent systems for genomics' and Group Leader in silico Discovery at Ecopia BioSciences Inc. Dr Baker received post doctoral training at Iogen Corporation and the University of Toronto after completing his Ph. D. studies in Environmental Microbiology and Enzymology at the University of Wales, Cardiff, UK. He is co-editor of the first book to illustrate deployment of the semantic web technologies in the life sciences.