Towards Automatic Pathway Generation from Biological Full-Text ...

4 downloads 0 Views 203KB Size Report
Hans Knöll Institute (HKI). {joerg.linde,steffen.priebe}@hki-jena.de. Abstract. We introduce an approach to the automatic generation of bi- ological pathway ...
Towards Automatic Pathway Generation from Biological Full-Text Publications Ekaterina Buyko1 , J¨ org Linde2 , Steffen Priebe2 , and Udo Hahn1 1

Jena University Language & Information Engineering (JULIE) Lab Friedrich-Schiller-Universit¨ at Jena, Germany {ekaterina.buyko,udo.hahn}@uni-jena.de 2 Research Group Systems Biology / Bioinformatics Leibniz Institute for Natural Product Research and Infection Biology Hans Kn¨ oll Institute (HKI) {joerg.linde,steffen.priebe}@hki-jena.de

Abstract. We introduce an approach to the automatic generation of biological pathway diagrams from scientific literature. It is composed of the automatic extraction of single interaction relations which are typically found in the full text (rather than the abstract) of a scientific publication, and their subsequent integration into a complex pathway diagram. Our focus is here on relation extraction from full-text documents. We compare the performance of automatic full-text extraction procedures with a manually generated gold standard in order to validate the extracted data which serve as input for the pathway integration procedure. Keywords: relation extraction, biological text mining, automatic database generation, pathway generation.

1

Introduction

In the life sciences community, a plethora of data is continuously generated by wet lab biologists, running either in vivo or in vitro experiments, and by bioinformaticians conducting in silico investigations. To cope with such massive volumes of biological data suitable abstraction layers are needed. Systems biology holds the promise to transform mass data into abstract models of living systems, e.g., by integrating single interaction patterns between proteins and genes into complex interaction networks and by visualizing them in terms of pathway diagrams (cf., e.g., [9,16]). However, both the determination of single interaction patterns (either from biological databases or biological publications), and their integration into interaction pathways are basically manual procedures which require skilled expert biologists to produce such data abstractions in a time-consuming and labor-intensive process. A typical example of the outcome of this challenging work is the KEGG database.1 Despite all these valuable efforts, Baumgartner et al. [2] have shown that the exponential growth rate of publications already 1

http://www.genome.jp/kegg/pathway.html

J. Gama, E. Bradley, and J. Hollm´ en (Eds.): IDA 2011, LNCS 7014, pp. 67–79, 2011. c Springer-Verlag Berlin Heidelberg 2011 

68

E. Buyko et al.

outpaces human capabilities to keep track with the speed of publication of documents relevant for database curation and pathway building. As a long-term goal, we aim at automating the process of pathway generation. This goal can be decomposed into several sub-goals: – Relation Extraction. The basic building blocks of pathways, viz. single interaction relations between proteins and genes, have to be harvested automatically from the scientific literature, full-text documents, in particular. – Full Text Analytics. Human pathway builders read the full text of scientific publications in order to get a comprehensive and detailed picture of the experiments being reported. In contrast, almost all approaches to automatic relation extraction are based on document surrogates, the accompanying abstracts, only. Running text analytics on full texts rather than on abstracts faces a significant increase in linguistic complexity (e.g., a larger diversity of the types of anaphora). Hence, performance figures differ for abstracts and their associated full texts when such phenomena are accounted for [15,6]. – Relation Integration. Once single interaction relations are made available, they have to be organized in a comprehensive pathway. This requires to identify overlapping arguments, determine causal and temporal orderings among the relations, etc. [12] That procedure is not just intended to replicate human comprehension and reasoning capabilities but much more should lead to more complete and even novel pathways that could not have been generated due to the lack of human resources. In this paper, our focus will be on the first two problems, while the third one is left for future work. Indeed, the success of text mining tools for automatic database generation, i.e., automatic relation extraction procedures complementing the work of human database curators, has already been demonstrated for RegulonDB,2 the world’s largest manually curated reference database for the transcriptional regulation network of E. coli (e.g., [14,8,6]). We here focus on a re-construction of a regulatory network for the human pathogenic microorganism C. albicans, which normally lives as a harmless commensal yeast within the body of healthy humans [13]. However, this fungus can change its benevolent behaviour and cause opportunistic superficial infections of the oral or genital epithelia. Given this pathogene behaviour, knowledge about the underlying regulatory interactions might help to understand the onset and progression of infections. Despite their importance, the number of known regulatory interactions in C. albicans is still rather small. The (manually built) Transfac database3 collects regulatory interactions and transcription factor binding sites (TFBS) for a number of organisms. However, it includes information about only five transcription factors of C. albicans. The Candida Genome Database4 includes the most up-to-date manually curated gene annotations. Although regulatory interactions might be mentioned, they are not the main focus of the database and often rely on micro-array experiments only. 2 3 4

http://regulondb.ccg.unam.mx/ http://www.biobase-international.com/index.php?id=transfac http://www.candidagenome.org/

Towards Automatic Pathway Generation

69

Currently, the HKI Institute in Jena is manually collecting transcription factor-target gene interactions for C. albicans. The workflow requires to carefully read research papers and critically interpret experimental techniques and results reported therein. Yet, the most time-consuming step is paper reading and understanding the interactions of a transcription factor or its target genes. In fact, it took this group more than two years to identify information about the 79 transcription factors (TFs) for C. albicans. To speed up this process, automatic text-mining algorithm were plugged in to identify regulatory interactions more rapidly and at a higher rate. Yet, such quantitative criteria (processing speed, number of identified relations, etc.) carefully have to be balanced with qualitative ones (reliability and validity of identified relations).

2

Related Work

Although relation extraction (RE) has become a common theme of natural language processing activities in the life sciences domain, it is mostly concerned with general, coarse-grained protein-protein interactions (PPIs). There are only few studies which deal with more fine-grained (and often also more complex) topics such as gene regulation. The Genic Interaction Extraction Challenge [11] marks an early attempt to determine the state-of-the-art performance of systems designed for the detection of gene regulation interactions. The best system achieved a performance of about 50% F-score. The results, however, have to be taken with care as the LLL corpus used in the challenge is severely limited in ˇ c et al. [17] extracted gene regulatory networks and achieved in the RE size. Sari´ task an accuracy of up to 90%. They disregarded, however, ambiguous instances, which may have led to the low recall plunging down to 20%. ˇ c et al.’s approach for the first largeRodriguez-Penagos et al. [14] used Sari´ scale automatic reconstruction of RegulonDB. The best results scored at 45% recall and 77% precision. Still, this system was specifically tuned for the extraction of transcriptional regulation in E. coli. Buyko and Hahn [6] using a generic RE system [5] achieved 38% recall and 33% precision (F-score: 35%) under reallife conditions. Hahn et al. [8] also pursued the idea of reconstructing curated databases from a more methodological perspective, and compared general rulebased and machine learning (ML)-based system performance for the extraction of regulatory events. Given the same experimental settings, the ML-based system slightly outperformed the rule-based one, with the additional advantage that the ML approach is intrinsically more general and thus scalable.

3 3.1

Gene Regulation Events as an Information Extraction Task Information Extraction as Relation Extraction

To illustrate the notion and inherent complexity of biological events, consider the following example. “p53 expression mediated by coactivators TAFII40 and

70

E. Buyko et al.

TAFII60.” contains the mention of two protein-protein interactions (PPIs), namely PPI(p53, TAFII40), and PPI(p53, TAFII60). More specifically, TAFII40 and TAFII60 regulate the transcription of p53. Hence, the interactions mentioned above can be described as Regulation of Gene Expression which is much more informative than just talking about PPI. The coarse-grained PPI event can thus further be decomposed into molecular subevents (in our example, Regulation and Gene Expression) or even cascades of events (formally expressed by nested relations). The increased level of detail is required for deep biological reasoning and comes already close to the requirements underlying pathway construction. Algorithmically speaking, event extraction is a complex task that can be subdivided into a number of subtasks depending on whether the focus is on the event itself or on the arguments involved: Event trigger identification deals with the large variety of alternative verbalizations of the same event type, e.g., whether the event is expressed in a verbal or in a nominalized form, in active or passive voice (for instance, “A is expressed ” vs. “the expression of A” both refer to the same event type, viz. Expression(A)). Since the same trigger may stand for more than one event type, event trigger ambiguity has to be resolved as well. Named Entity Recognition is a step typically preceding event extraction. Basically, it aims at the detection of all possible event arguments, usually, within a single sentence. Argument identification is concerned with selecting from all possible event arguments (the set of sentence-wise identified named entities) those that are the true participants in an event, i.e., the arguments of the relation. Argument ordering assigns each true participant its proper functional role within the event, mostly the acting Agent (and the acted-upon Patient/Theme). In the following, we are especially interested in Regulation of Gene Expression (so-called ROGE) events (which are always used synonymous with regulatory relations). For our example above, we first need to identify all possible gene/protein names (p53, TAFII40, TAFII60 ), as well as those event triggers that stand for Gene Expression and Regulation subevents (“expression” and “mediate”, respectively). Using this information, we extract relationships between genes/proteins. Given our example, the outcome are two ROGE events — p53 is assigned the Patient role, while TAFII40 and TAFII60 are assigned the Agent roles. In the next section, we will focus on ROGE events only. 3.2

Regulation of Gene Expression Events

The regulation of gene expression can be described as the process that modulates the frequency, rate or extent of gene expression, where gene expression is the process in which the coding sequence of a gene is converted into a mature gene product or products, namely proteins or RNA (taken from the definition of

Towards Automatic Pathway Generation

71

the Gene Ontology class Regulation of Gene Expression, GO:0010468).5 Transcription factors, cofactors and regulators are proteins that play a central role in the regulation of gene expression. The analysis of gene regulatory processes is ongoing research work in molecular biology and affects a large number of research domains. In particular, the interpretation of gene expression profiles from micro-array analyses could benefit from exploiting our knowledge of ROGE events as described in the literature. Accordingly, we selected two document collections as training corpora for the automatic extraction of ROGE events. GeneReg Corpus. The GeneReg corpus consists of more than three hundred PubMed6 abstracts dealing with the regulation of gene expression in the model organism E. coli. GeneReg provides three types of semantic annotations: named entities involved in gene regulatory processes, e.g., transcription factors and genes, events involving regulators and regulated genes, and event triggers. Domain experts annotated ROGE events, together with genes and regulators affecting the expression of the genes. This annotation was based on the Gro ontology [3] class ROGE, with its two subclasses Positive ROGE and Negative ROGE. An event instance contains two arguments, viz. Agent, the entity that plays the role of modifying gene expression, and Patient, the entity whose expression is modified. Transcription factors (in core ROGE), or polymerases and chemicals (in auxiliary ROGE) can act as agents.7 The sentence “H-NS and StpA proteins stimulate expression of the maltose regulon in Escherichia coli.” contains two Positive ROGE instances with H-NS and StpA as regulators and maltose regulon as regulated gene group. GeneReg comes with 1,200 core events, plus 600 auxiliary ROGE events. BioNLP Shared Task Corpus. The BioNLP-ST corpus contains a sample of 950 Medline abstracts. The set of molecular event types being covered include, among others, Binding, Gene Expression, Transcription, and (positive, negative, unspecified) Regulation. Buyko et al. [4] showed that the regulation of gene expression can be expressed by means of BioNLP-ST Binding events. For example, the sentence “ModE binds to the napF promoter.” describes a Binding event between ModE and napF. As ModE is a transcription factor, this Binding event stands for the description of the regulation of gene expression and, thus, has to be interpreted as a ROGE event. We selected for our experiments all mentions of Binding events from the BioNLP-ST corpus.

4

The JReX Relation Extraction System

The following experiments were run with the event extraction system JReX (Jena Relation eXtractor). Generally speaking, JReX classifies pairs of genes in 5 6 7

http://www.geneontology.org/ http://www.pubmed.org Auxiliary regulatory relations (606 instances) extend the original GeneReg corpus.

72

E. Buyko et al.

sentences as interaction pairs using various syntactic and semantic decision criteria (cf. Buyko et al. [5] for a more detailed description). The event extraction pipeline of JReX consists of two major parts, the pre-processor and the dedicated event extractor. The JReX pre-processor uses a series of text analytics tools to annotate and thus assign structure to the plain text stream using linguistic meta data (see Subsection 4.1 for more details). The JReX event extractor incorporates manually curated dictionaries and ML technology to sort out associated event triggers and arguments on this structured input (see Subsection 4.2 for further details). Given that methodological framework, JReX (from the JulieLab team) scored on 2nd rank among 24 competing teams in the BioNLP’09 Shared Task on Event Extraction,8 with 45.8% precision, 47.5% recall and 46.7% F-score. After the competition, this system was further streamlined and achieved 57.6% precision, 45.7% recall and 51.0% F-score [5], thus considerably narrowing the gap to the winner of the BioNLP’09 Shared Task (Turku team, 51.95% F-score). JReX exploits two fundamental sources of knowledge to identify ROGE events. First, the syntactic structure of each sentence is taken into account. In the past years, dependency grammars and associated parsers have turned out to become the dominating approach for information extraction tasks to represent the syntactic structure of sentences in terms of dependency trees (cf. Figure 1 for an example). Basically, a dependency parse tree consists of (lexicalized) nodes, which are labelled by the lexical items a sentence is composed of, and edges linking pairs of lexicalized nodes, where the edges are labelled by dependency relations (such as SUBject-of, OBJect-of, etc.). These relations form a limited set of roughly 60 types and express fundamental syntactic relations among a lexical head and a lexical modifier. Once a dependency parse has been generated for each sentence, the tree, first, undergoes a syntactic simplification process in that (from the perspective of information extraction) semantically irrelevant lexical material is deleted from this tree. This leads to so-called “trimmed” dependency trees which are much more compact than the original dependency parse trees. For example, JReX prunes auxiliary and modal verbs which govern the main verb in syntactic structures such as passives, past or future tense. Accordingly, (cf. Figure 1), the verb “activate” is promoted to the Root in the dependency graph and governs all nodes that were originally governed by the modal “may”. Second, conceptual taxonomic knowledge of the biology domain is taken into consideration for semantic enrichment. Syntactically trimmed dependency trees still contain lexical items as node labels. These nodes are screened whether they are relevant for ROGE events and, if so, lexical labels are replaced by conceptual labels at increasing levels of semantic generality. For instance (cf. Figure 1), the lexical item “TNF-alpha” is turned into the conceptual label Gene. This abstraction avoids over-fitting of dependency structures for the machine learning mechanisms on which JReX is based. 8

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/

Towards Automatic Pathway Generation

73

Fig. 1. Trimming of dependency graphs

4.1

JReX Pre-processor

As far as pre-processing is concerned, JReX uses the JCore tool suite [7], e.g., the JulieLab sentence splitter and tokenizer. For shallow syntactic analysis, we apply the OpenNLP POS Tagger and Chunker,9 both re-trained on the Genia corpus.10 For dependency parsing, the MST parser [10] was retrained on the Genia Treebank and the parses subsequently converted to the CoNLL’07 format.11 The data is further processed for named entity recognition and normalization with the gene tagger GeNo [18] and a number of regex- and dictionary-based entity taggers (covering promoters, binding sites, and transcription factors). 4.2

JReX Event Extractor

The JReX event extractor accounts for three major subtasks – first, the detection of lexicalized event triggers, second, the trimming of dependency graphs which involves eliminating informationally irrelevant and enriching informationally relevant lexical material, and, third, the identification and ordering of arguments for the event under scrutiny. Event Trigger Detection & Trimming of Dependency Graphs. Event triggers cover the large variety of lexicalizations of an event. Unfortunately, these lexical signals often lack discriminative power relative to individual event types and thus carry an inherent potential for ambiguity.12 For event trigger detection, the JReX system exploits various trigger terminologies gathered from selected corpora (see Subsection 3.2), extended and curated by biology students. The 9 10 11

12

http://incubator.apache.org/opennlp/ http://www-tsujii.is.s.u-tokyo.ac.jp/~ genia/topics/Corpus/ We used Genia Treebank version 1.0, available from http://www-tsujii. is.s.u-tokyo.ac.jp. The conversion script is accessible via http://nlp.cs. lth.se/pennconverter/ Most of the triggers are neither specific for molecular event mentions, in general, nor for a particluar event type. “Induction”, e.g., occurs 417 times in the BioNLP-ST training corpus as putative trigger. In 162 of these cases it acts as a trigger for the event type Positive Regulation, 6 times as a trigger for Transcription, 8 instances trigger Gene Expression, while 241 occurrences do not trigger an event at all.

74

E. Buyko et al.

event trigger detection is performed with the Lingpipe Dictionary Chunker.13 In the next step, JReX performs the deletion and re-arrangement of syntactic dependency graph nodes by trimming dependency graphs. This process amounts to eliminating informationally irrelevant and to enriching informationally relevant lexical nodes by concept overlays as depicted in Figure 1 (for a detailed account and an evaluation of these simplification procedures, cf. [5]). Argument Identification and Ordering. The basic event extraction steps are argument detection and ordering. For argument extraction, JReX builds pairs of triggers and their putative arguments (named entities) or triggerless pairs of named entities only. For argument extraction, JReX uses two classifiers: Feature-based Classifier. Three groups of features are distinguished: (1) lexical features (covering lexical items before, after and between the mentions of the event trigger and an associated argument; (2) chunking features (concerned with head words of the phrases between two mentions; (3) dependency parse features (considering both the selected dependency levels of the arguments, parents and least common subsumer, as well as the shortest dependency path structure between the arguments for walk features). For this feature-based approach, the Maximum Entropy (MaxEnt) classifier from Mallet is used.14 Graph Kernel Classifier. The graph kernel uses a converted form of dependency graphs in which each dependency node is represented by a set of labels associated with that node. The dependency edges are also represented as nodes in the new graph such that they are connected to the nodes adjacent in the dependency graph. Subgraphs which represent, e.g., the linear order of the words in the sentence can be added, if required. The entire graph is represented in terms of an adjacency matrix which is further processed to obtain the summed weights of paths connecting two nodes of the graph (cf. Airola et al. [1] for details). For the graph kernel approach, the LibSVM Support Vector Machine is used as classifier.15

5

Experiments

5.1

Evaluation Scenario and Experimental Settings

C. albicans Interactions as Gold Standard. In the following, the C. albicans Regulatory Network (RN) refers to the collection of regulatory interactions curated by the HKI in Jena. C. albicans RN includes the following information for each regulation event: regulatory gene (the Agent in such an event, a transcription factor (TF)), and the regulated gene (the Patient). Evaluation against C. albicans RN constitutes a challenging real-life scenario. 13 14 15

http://alias-i.com/lingpipe/ http://mallet.cs.umass.edu/index.php/Main_Page http://www.csie.ntu.edu.tw/~ cjlin/libsvm

Towards Automatic Pathway Generation

75

We use two gold standard sets for our evaluation study, viz. C. albicans RN and C. albicans RN-small. C. albicans RN contains 114 interactions for 31 TFs. As this set does not contain any references from interactions to the full texts they were extracted from, there is no guarantee that the reference full texts are indeed contained in our document sets we used for this evaluation study. Therefore, we built the C. albicans RN-small set, the subset of the C. albicans RN, which is complemented with references to full texts (8 documents) and contains 40 interactions for 7 TFs. C. albicans RN-small serves here as a gold standard set directly linked to the literature for a more proper evaluation scenario of JReX. Processing and Evaluation Settings. For our experiments, we re-trained the JReX argument extraction component on the corpora presented in Section 3.2, i.e., on the GeneReg corpus and on Binding event annotations from the BioNLP-ST corpus. As Binding events do not represent directed relations, we stipulate here that the protein occurring first is assigned the Agent role.16 For argument detection, we used the graph kernel and MaxEnt models in an ensemble configuration, i.e., the union of positive instances was computed.17 To evaluate JReX against C. albicans RN we, first, processed various sets of input documents (see below), collected all unique gene regulation events extracted this way, and compared this set of events against the full set of known events in C. albicans RN. A true positive (TP) hit is obtained, when an event found automatically corresponds to one in C. albicans RN, i.e., having the same agent and patient. The type of regulation is not considered. A false positive (FP) hit is counted, if an event was found which does not occur in the same way in C. albicans RN, i.e., either patient or agent (or both) are wrong. False negatives (FN) are events covered by C. albicans RN but not found by the system automatically. By default, all events extracted by the system are considered in the “TF-filtered” mode, i.e., only events with an agent from the list of all known TFs for C. albicans from C. albicans RN are considered. From the hits we calculated standard precision, recall, and F-score values. Of course, the system performance largely depends on the size of the base corpus collection being processed. Thus, for all three document sets we got separate performance scores. Input Document Sets. Various C. albicans research paper sets were prepared for the evaluation against the C. albicans gold standards. We used a PubMed search for “Candida albicans” and downloaded 17,750 Medline abstracts and 6,000 freely available papers in addition to approximately 1,000 documents nonfree full text articles collected at the HKI. An overview of the data sets involved is given in Table 1. The CA set contains approx. 17,750 Medline abstracts, the CF holds more than 7,000 documents. The CF-small set is composed of 8 full 16 17

In particular transcription factors that bind to regulated genes are mentioned usually before the mention of regulated genes. That is, an SVM employing a graph kernel as well as a MaxEnt model were used to predict on the data. Both prediction results were merged by considering each example as a positive event that had been classified as a positive event by either the SVM or the MaxEnt model.

76

E. Buyko et al. Table 1. Document sets collected for the evaluation study Document Set CA - C. albicans abstracts CF - C. albicans full texts CF-small - C. albicans RN-small texts

Number of Documents 17,746 7,024 8

texts from the CF set that are used as references for C. albicans interactions from C. albicans RN-small. 5.2

Experimental Results

We ran JReX with the re-trained argument extractor on all documents sets introduced above. As baseline we decided on simple sentence-wise co-occurrence of putative event arguments and event triggers, i.e., if two gene name mentions and at least one event trigger appear in a sentence, that pair of genes is considered to be part of a regulatory relation. As the regulatory relation (ROGE event) is a directed relation, we built two regulatory relations with interchanged Agent and Patient. The results of the baseline and JReX runs are presented in Table 2 for the C. albicans RN set and Table 3 for the C. albicans RN-small set. Using the baseline, the best recall could be achieved on full texts with 0.84 points on the C. albicans RN (see Table 2) and with 0.90 points on the C. albicans RN-small set (see Table 3). For the abstract sets we could achieve only low recall values of 0.27 points and 0.31 points on both sets, respectively. This outcome confirms the reasonable assumption that full text articles contain considerably more mentions of interaction events than their associated abstracts. Still, the precision of the baseline on full text articles is miserable (0.01 on the C. albicans RN set and 0.02 on the C. albicans RN-small set), respectively. This data indicates that a more sophisticated approach to event extraction such as the one underlying the JReX system is much needed. On the C. albicans RN set, with JReX we achieved 0.35 recall and 0.18 precision points on full texts (see Table 2). Given the lack of literature references from the C. albicans RN set, we cannot guarantee that our document collection contains all relevant documents. Therefore, we used the C. albicans RN-small set for a proper evaluation of JReX. The best JReX-based results could be achieved on full texts analyzing 8 officially referenced full text articles. JReX achieves here 0.52 points F-score with 0.54 points recall and 0.51 points precision (see Table 2. Event extraction evaluated on full data set C. albicans RN for known transcription factors in C. albicans . Recall/Precision/F-score values are given for each document set. Full Texts Abstracts All Sets R/P/F R/P/F R/P/F Co-occurrence 0.84/0.01/0.03 0.27/0.09/0.14 0.84/0.01/0.03 JReX 0.35/0.18/0.24 0.13/0.29/0.18 0.35/0.18/0.23

Text Mining

Towards Automatic Pathway Generation

77

Table 3. Event extraction evaluated on referenced data set C. albicans RN-small for known transcription factors in C. albicans . Recall/Precision/F-score values are given for each document set. Full Texts Abstracts C. albicans RN-small texts All Sets R/P/F R/P/F R/P/F R/P/F Co-occurrence 0.90/0.02/0.03 0.31/0.13/0.18 0.72/0.14/0.24 0.90/0.02/0.03 JReX 0.64/0.30/0.41 0.13/0.28/0.18 0.54/0.51/0.52 0.64/0.29/0.40

Text Mining

Table 3). This level of performance matches fairly well with the results JReX obtained in the BioNLP 2009 Shared Task. When we use all the data sets for the evaluation on the C. albicans RN-small set, we our recall peaks at 0.64 points with a reasonable precision of 0.29 and an F-score at 0.40 (see Table 3). As the recall figures of our baseline reveal that abstracts contain far less ROGE events than full texts, we have strong evidence that the biomedical NLP community needs to process full-text articles to effectively support database curation. As full-text documents are linguistically more complex and thus harder to process, the relative amount of errors is higher than on abstracts. The manual analysis of false negatives revealed that we miss, in particular, relations that are described in a cross-sentence manner using coreferential expressions. As JReX extracts relations between entities only within the same sentence, the next step should be to incorporate cross-sentence mentions of entities as well.

6

Conclusion

Pathway diagrams for the visualization of complex biological interaction processes between genes and proteins are an extremely valuable generalization of tons of experimental biological data. Yet, these diagrams are usually built by highly skilled humans. As an alternative, we here propose to automate the entire process of pathway generation. The focus of this work, however, is on extracting relations, the building blocks of pathways, from the original full-text publications rather than from document surrogates such as abstracts, the common approach. Our evaluation experiments reveal that indeed full texts have to be screened rather than informationally poorer abstracts. This comes at the price of dealing with more complex linguistic material—full texts exhibit linguistic structures (e.g., anaphora) typically not found (at that rate) in abstracts. After the integration of anaphora resolution tools, the next challenge will consist of integrating all harvested relations into a single, coherent pathway, thus reflecting causal, temporal, and functional ordering constraints among the single relations [12]. Acknowledgments. This work is partially funded by a grant from the German Ministry of Education and Research (BMBF) for the Jena Centre of Systems Biology of Ageing (JenAge) (grant no. 0315581D).

78

E. Buyko et al.

References 1. Airola, A., Pyysalo, S., Bj¨ orne, J., Pahikkala, T., Ginter, F., Salakoski, T.: A graph kernel for protein-protein interaction extraction. In: BioNLP 2008 – Proceedings of the ACL/HLT 2008 Workshop on Current Trends in Biomedical Natural Language Processing, Columbus, OH, USA, June 19, pp. 1–9 (2008) 2. Baumgartner Jr., W.A., Cohen, K.B., Fox, L.M., Acquaah-Mensah, G., Hunter, L.: Manual curation is not sufficient for annotation of genomic databases. Bioinformatics (ISMB/ECCB 2007 Supplement) 23(13), i41–i48 (2007) 3. Beisswanger, E., Lee, V., Kim, J.j., Rebholz-Schuhmann, D., Splendiani, A., Dameron, O., Schulz, S., Hahn, U.: Gene Regulation Ontology gro: Design principles and use cases. In: MIE 2008 – Proceedings of the 20th International Congress of the European Federation for Medical Informatics, G¨ oteborg, Sweden, May 26-28, pp. 9–14 (2008) 4. Buyko, E., Beisswanger, E., Hahn, U.: The GeneReg corpus for gene expression regulation events: An overview of the corpus and its in-domain and out-of-domain interoperability. In: LREC 2010 – Proceedings of the 7th International Conference on Language Resources and Evaluation, La Valletta, Malta, May 19-21, pp. 2662– 2666 (2010) 5. Buyko, E., Faessler, E., Wermter, J., Hahn, U.: Syntactic simplification and semantic enrichment: Trimming dependency graphs for event extraction. Computational Intelligence 27(4) (2011) 6. Buyko, E., Hahn, U.: Generating semantics for the life sciences via text analytics. In: ICSC 2011 – Proceedings of the 5th IEEE International Conference on Semantic Computing, Stanford University, CA, USA (September 19-21, 2011) 7. Hahn, U., Buyko, E., Landefeld, R., M¨ uhlhausen, M., Poprat, M., Tomanek, K., Wermter, J.: An overview of JCoRe, the Julie Lab Uima component repository. In: Proceedings of the LREC 2008 Workshop ‘Towards Enhanced Interoperability for Large HLT Systems: UIMA for NLP’, Marrakech, Morocco, May 31, pp. 1–7 (2008) 8. Hahn, U., Tomanek, K., Buyko, E., Kim, J.J., Rebholz-Schuhmann, D.: How feasible and robust is the automatic extraction of gene regulation events? A crossmethod evaluation under lab and real-life conditions. In: BioNLP 2009 – Proceedings of the NAACL/HLT BioNLP 2009 Workshop, Boulder, CO, USA, June 4-5, pp. 37–45 (2009) 9. Luciano, J.S., Stevens, R.D.: e-sience and biological pathway semantics. BMC Bioinformatics 8 (Suppl 3) (S3) (2007) 10. McDonald, R.T., Pereira, F., Kulick, S., Winters, R.S., Jin, Y., Pete, W.: Simple algorithms for complex relation extraction with applications to biomedical IE. In: ACL 2005 – Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, MI, USA, June 25-30, pp. 491–498 (2005) 11. N´edellec, C.: Learning Language in Logic: Genic interaction extraction challenge. In: Proceedings LLL 2005 – 4th Learning Language in Logic Workshop, Bonn, Germany, August 7, pp. 31–37 (2005) 12. Oda, K., Kim, J.D., Ohta, T., Okanohara, D., Matsuzaki, T., Tateisi, Y., Tsujii, J.: New challenges for text mining: Mapping between text and manually curated pathways. BMC Bioinformatics 9(suppl. 3) (S5) (2008) 13. Odds, F.C.: Candida and Candidosis, 2nd edn. Bailli`ere Tindall, London (1988)

Towards Automatic Pathway Generation

79

14. Rodr´ıguez-Penagos, C., Salgado, H., Mart´ınez-Flores, I., Collado-Vides, J.: Automatic reconstruction of a bacterial regulatory network using natural language processing. BMC Bioinformatics 8(293) (2007) 15. Sanchez, O., Poesio, M., Kabadjov, M.A., Tesar, R.: What kind of problems do protein interactions raise for anaphora resolution? A preliminary analysis. In: SMBM 2006 – Proceedings of the 2nd International Symposium on Semantic Mining in Biomedicine, Jena, Germany, April 9-12, pp. 109–112 (2006) 16. Viswanathan, G.A., Seto, J., Patil, S., Nudelman, G., Sealfon, S.C.: Getting started in biological pathway construction and analysis. PLoS Computational Biology 4(2), e16 (2008) ˇ c, J., Jensen, L.J., Ouzounova, R., Rojas, I., Bork, P.: Extracting regulatory 17. Sari´ gene expression networks from PubMed. In: ACL 2004 – Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Barcelona, Spain, July 21-26, pp. 191–198 (2004) 18. Wermter, J., Tomanek, K., Hahn, U.: High-performance gene name normalization with GeNo. Bioinformatics 25(6), 815–821 (2009)

Suggest Documents