A survey on annotation tools for the biomedical literature

Briefings in Bioinformatics Advance Access published December 18, 2012

B RIEFINGS IN BIOINF ORMATICS . page 1 of 14

doi:10.1093/bib/bbs084

A survey on annotation tools for the biomedical literature Mariana Neves and Ulf Leser Submitted: 7th September 2012; Received (in revised form): 5th November 2012

Abstract

Keywords: annotation tools; curation tools; gold standard corpora; text mining

INTRODUCTION In text mining, the development of new applications crucially depends on the existence of pre-annotated texts, called gold standard corpora (GSC). GSC are important for evaluating the performance of tools, for learning patterns or models in supervised information extraction and for an unbiased comparison of tools [1, 2]. This is particularly true in knowledgerich domains such as the life sciences where texts deal with millions of entities whose naming often does not follow any regularities or conventions [3] and where relationships between entities may differ in highly subtle ways [4]. These properties imply that approaches purely built on background knowledge, such as dictionaries for entity names or language

patterns specific for the intended relationships, often lead to insufficient recall [5]. A GSC is a set of documents where all mentions of facts of interest have been manually annotated by a human expert. GSC essentially provide examples for algorithms which learn commonalities between the annotated facts and which then try to generalize these commonalities into rules also finding novel facts [6]. This approach has proven to be highly successful in many biomedical text mining tasks [7–9], leading to an increasing interest in biomedical GSC. Examples at the entity level are corpora for detecting gene names [10], species names [11] or names of chemical entities [12]; at the relationship level, examples are corpora for studying protein–protein interactions (PPIs) [13],

Corresponding author. Mariana Neves, Department of Computer Science, Humboldt-Universita¨t zu Berlin, Rudower Chaussee 25, 12489 Berlin, Germany. Tel: þ49-30-2093-5486; Fax: þ49-30-2093-3045; E-mail: [email protected] Mariana Neves has a PhD in computer science from the Universidad Complutense de Madrid, Spain. She is currently working as a post-doc researcher in biomedical text mining in the group of Prof. Leser, Humboldt-Universita¨t zu Berlin, Germany. Ulf Leser holds the Chair of Knowledge Management in Bioinformatics at the Department for Computer Science, Humboldt-Universita¨t zu Berlin, Germany. ß The Author 2012. Published by Oxford University Press. For Permissions, please email: [email protected]

Downloaded from http://bib.oxfordjournals.org/ at Humboldt-Universitaet zu Berlin on January 1, 2013

New approaches to biomedical text mining crucially depend on the existence of comprehensive annotated corpora. Such corpora, commonly called gold standards, are important for learning patterns or models during the training phase, for evaluating and comparing the performance of algorithms and also for better understanding the information sought for by means of examples. Gold standards depend on human understanding and manual annotation of natural language text. This process is very time-consuming and expensive because it requires high intellectual effort from domain experts. Accordingly, the lack of gold standards is considered as one of the main bottlenecks for developing novel text mining methods. This situation led the development of tools that support humans in annotating texts. Such tools should be intuitive to use, should support a range of different input formats, should include visualization of annotated texts and should generate an easy-to-parse output format. Today, a range of tools which implement some of these functionalities are available. In this survey, we present a comprehensive survey of tools for supporting annotation of biomedical texts. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth comparison. The comparison was performed using predefined criteria and was accompanied by hands-on experiences whenever possible. Our survey shows that current tools can support many of the tasks in biomedical text annotation in a satisfying manner, but also that no tool can be considered as a true comprehensive solution.

page 2 of 14

Neves and Leser on linguistic annotations. To the best of our knowledge, a survey specifically focusing on the needs of the biomedical domain does not exist yet. One important distinction to make is that between annotation and curation. By annotation, we mean the complete tagging of a given text with respect to some intended facts [17], usually for the construction of a gold standard. Sometimes, also facts that might be irrelevant for the intended scope should be annotated. For instance, if a GSC should be created for PPIs, also mentions of proteins not taking part in an interaction should be tagged, as this provides important negative information for training and evaluation. By curation, in turn, we mean the analysis of a given document with respect to some information being sought for. For instance, curated PPI databases scan documents to extract information about protein interactions [30]. Curating a document does not always imply tagging at the text level but sometimes only at the document level. Furthermore, for instance, many literal PPIs might be ignored because they do not obey quality standards of the curators (e.g. hypothetical PPI or negated PPI). Curation can be supported by text mining [31] and other tools [32, 33], but the requirements for the later are quite different. This survey focuses on annotating texts; however, in the ‘Discussion’ section, we will also briefly discuss how well the evaluated tools are suitable for curation. Altogether, we considered almost 30 tools, 13 of which were selected for an in-depth analysis. We excluded tools which are not freely available or that focus only on linguistic annotations. Thus, we only considered tools which had been applied at least once in the biomedical domain. The 13 tools meeting these constraints were analyzed in detail with respect to 35 predefined criteria, including issues of documentation and extensibility, supported formats and platforms, implemented functionality and popularity. Furthermore, we locally installed and tested most of the 13 tools to be able to report on difficulty of installation and the general (naturally somewhat subjective) look-and-feel. We also discuss which tools are more suitable for specific situations, such as which of them support the annotation of full texts as well as relationships besides only entities.

MATERIALS AND METHODS We screened PubMed and Google Scholar, checked corpora publications and used Google search engine to create a list of tools potentially useful for


gene regulatory relationships [14], complex events [15] or drug–drug interactions [16]. A comprehensive collection of corpora for studying relationships between biomedical entities can be found at http:// corpora.informatik.hu-berlin.de. The manual annotation of text documents is a time-consuming and error-prone task, as it requires a high intellectual effort from domain experts, such as biologists or pharmacists [17]. Annotation of comprehensive corpora frequently involves a team of annotators, often with slightly different opinions about the facts under study, leading to inconsistent annotations. Comprehensive annotation guidelines are therefore important, i.e. documents which convey a precise definition of the facts to be annotated. The development of such guidelines is also a complex undertaking that often proceeds in an incremental fashion, which in turn makes necessary the re-annotation of texts [17]. This situation makes the creation of large GSC a project that often turns out to be much more expensive and much more involved than expected at the start. However, the creation of GSC is a task that can be well supported by computational tools. Supportive tasks include graphical interfaces for highlighting and tagging texts, management of text collections, assignment of texts to annotators, definition of annotation schemes (classes of entities and relationships that may be annotated) or export of the result into various formats. If available, existing text mining tools can be integrated to provide pre-tagged input texts. This situation led to the development of a number of tools that support experts in annotating texts. These tools range from simple word processors with auto-completion functionalities (e.g. Notepadþþ) over standalone tools with graphical user interfaces [18–20] to web-based solutions suitable for large distributed annotation projects [21–24]. Tools vary widely in their functionalities. For instance, some tools allow arbitrary annotation schemes [20] while others support only a fixed scheme [25, 26]; some accept multiple input and/or create multiple output formats [27] while others are bound to a single format [24]; some focus on linguistic text properties [19] while others excel in domain-specific mentions [21]. Despite the importance of such tools for text mining, there exist very few previous surveys on them. The study by Uren et al. [28] focuses on the Semantic Web and does not include most of the tools covered in this work. Dipper et al. [29] provide a survey of five XML-based annotation tools focusing

Survey on annotation tools for the biomedical literature annotating texts. All appropriate tools were evaluated due to the selection criteria presented in ‘Selection of tools’ section. Tools matching these requirements were assessed in details using the criteria described in ‘Criteria for selected tools’ section.

Selection of tools

this task were considered further, skipping the hands-on experiences. Still, downloading functionality must be working. We give our opinion regarding the easiness of installation in the detailed evaluation of each tool in Section 2 in Supplementary Data.

Criteria for selected tools We identified 13 tools meeting our requirements. For those, we perform an in-depth evaluation based on published features and hands-on experiences. Tools were evaluated with respect to a list of 35 criteria grouped into the following categories (cf. Table 1): basics, publication, system properties, data and functionalities. Basics criteria assess the quality of the documentation and the existence of support (e.g. mailing lists and forum). Publication criteria encompass year of the last publication, number of citations (according to Google Scholar) and abundance of papers describing application of the tool to biomedical texts. System properties include availability of code, supported operating systems, general architecture (stand-alone, web-based, plug-in) and license. The data section addresses formats supported by the tool for input documents, input of preannotations, output for the annotations and annotation scheme. Finally, the functionalities group contains features to support the annotation process, like the smallest unit of annotation (character or token), built-in biomedical named-entity extraction, support for fast annotation (shortcut keys), pre-annotations, ontologies or inter-annotator agreement.

RESULTS We screened the scientific literature for descriptions of tools that support the annotation of biomedical corpora. Tools were selected for an in-depth analysis if they are freely available, support a sufficient general class of annotation tasks and have been previously applied to biomedical texts. Thirteen tools met these criteria: @Note, Argo, Bionotate, Brat, Callisto, Djangology, GATE Teamware, Knowtator, MMAX2, MyMiner, Semantator, WordFreak and XConc Suite. Wherever possible and feasible, we also installed the tools locally and experimented with their functionality and user


There is a variety of tools available for the annotation of textual documents. In this survey, we focus on tools for manual annotation of semantic facts in biomedical documents and, optionally, for supporting database curation [34]. Therefore, a tool should allow manual annotation of texts. Manual annotation may be at any level, e.g. document, snippet, sentence or mention. Tools should support annotation of named entities and, optionally, the annotation of relationships. We put no constraints regarding supported formats. Finally, tools should be freely available, which, among others, excluded for instance [25]. We focused on general purpose tools for the manual annotation of biomedical documents, which is testified by at least one report on successful application of the tool for creating a gold standard or curating biomedical data. This, for instance, excluded tools with a sole linguistic purpose or which have been used for named-entity annotation in other domains, such as Ellogon [35], PalinkA [36], SALTO [37], Slate [38] or UAM Corpus [39]. We also excluded tools which have been used for linguistic annotation in biomedical documents (but not for biomedical facts). For instance, in [40], biomedical documents belonging to the Genia corpus [10] have been annotated on discourse relations using the PDTB tool. Regarding the tools which have been used for biomedical corpus annotation, we only excluded CADIXE (http://caderige.imag.fr/ Cadixe/) used in [41, 42], which it is only available by demand, because we did not receive an answer from the tool’s developers. Tools which focus on annotations of Web documents have not been considered either, such as Domeo [43]. General purpose curation tools which do not allow the manual annotation of texts, such as Textpresso [33] and PaperBrowser [44], are also not the focus of this survey. Finally, we did not consider further all tools that were developed for a single specific question without apparent abilities to be adapted to other tasks, such as [32, 45], GOAnnotator [46] and PreBIND [47]. We tried to install all tools fulfilling these requirements. Also tools for which we did not succeed in

page 3 of 14

page 4 of 14

Neves and Leser

Table 1: Criteria used for the evaluation of annotation tools Criteria

Code

Basics

Worth looking at the tool Quality of the documentation Existence of support (forum, mailing list) Year of the last publication Number of references in Google Scholar Number of references related to corpora development Type of installation (web, stand-alone, plug-in) Supported operating systems Availability of source code Dependence on external software Date of the last version Easiness of installation License of the tool Input format for the documents Input format for the annotations Format of the annotation schema Output format for the annotations (in-line, stand-off) Linguistic level of annotations (token, character) Allowance of overlapping annotations Possibility of using only the keyboard Automatic selection of a token when clicked Support for fast annotation Allowance of comments on the annotation Ability to annotate relationships Support for ontologies and terminologies Allowance of document collections Support for full text Allowance of saving documents partially Ability to highlight or bookmark parts of the text Pre-processing the text (sentence splitting, tokenization, part-of-speech tagging) Built-in biomedical named-entities recognition Easiness to import pre-annotations Assignment of the documents to the annotators Inter-annotator agreement Queries to a document or collection

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25 C26 C27 C28 C29 C30 C31 C32 C33 C34 C35

Publication

System properties

Data

Functionalities

interface. Installation succeeded with one exception: GATE Teamware. In this chapter, we present each of these tools in detail and highlight their strengths and pitfalls. The main technical features of each tool are summarized in Table 2 while the functional ones are shown in Table 3. Tools are presented in alphabetical order. A more detailed description of each tool can be found in Section 2 in Supplementary Data.

@Note, University of Minho (Portugal) and University of Vigo (Spain) @Note [48] is a workbench supporting article retrieval, journal crawling, PDF-to-text conversion, pre-processing and annotation of documents. The tool has been used for annotating microbial cellular responses [49] and stringent response for Escherichia

coli [50]. It is not available as open source, although plug-ins may be added to it through the AIBench Framework. It only allows the annotation of named entities and a limited set of dictionaries is included to perform manual or automatic annotation. It is pre-configured for the annotation of 14 biological entity classes and can be extended to support arbitrary entity types. Documents collections are constructed through queries to Medline and full texts can be annotated only if the corresponding PDF is found in the Web.

Argo, National Centre for Text Mining (NaCTeM), University of Manchester (UK) Argo [24] is a web-based curation tool that has been used for annotating interactions between marine drugs and enzymes [24]. It is similar to


Category

Survey on annotation tools for the biomedical literature

page 5 of 14

Table 2: Comparison of all tools according to selected technical criteria Year

Ref. Corp.

Install.

Src.

Depend.

Scheme

Input

Output

License

@Note Argo Bionotate Brat Callisto Djangology GATE Knowtator

2009 2012 2009 2012 2004 2010 2010 2006

14 1 12 4 16 1 8 62

2a 1a 2a 3 (2a) 3 1a 3 9 (1a)

SA Web Web Web SA Web SA, Web plug-in

^ ^ 3 3 3 3 3 3

Java, MySQL Web browser Apache Apache, Python Java Django, Python Java, Perl, MySQL Prote¤ge¤ Frames

^ UIMA XML TXT DTD GUI GUI Ontology

TXT, PDF TXT TXT TXT TXT DB TXT TXT, XML

XML IL ^ XML SO TXT SO XML SO DB XML IL, DB XML SO

87 0 0 35

2 1a 2a 7

SA Web Plug-in SA

3 ^ ^ 3

Java Web browser Prote¤ge¤ OWL Java

XML GUI OWL ontology Java classes

TXT, XML TXT, PDF TXT TXT, XML

XML SO TXT IL RDF, XML SO XML SO

^

4a

Plug-in

3

Eclipse

DTD

XML

XML IL

^ NaCTeM GNU GPL CC BY 3.0 MITRE Co. ^ GNU AGPL Mozilla Public License 1.1 ^ ^ ^ Mozilla Public License 1.1 GNU LGPL

MMAX2 MyMiner Semantator WordFreak

2006 2012 2011 2003

XConc

^

Abbreviations and corresponding criterion are as follows: ‘Year’: year when the tool was first published (cf.C4).‘Ref.’: number of references found for the publication (as of August 2012) (cf. C5). ‘Corp.’: number of publications referring to applications in biomedical annotation (cf. C6); ‘a’ indicates whether the publication is from the developers of the tool. ‘Install.’: type of installationçplug-in, stand-alone (SA) or web-based (cf. C7). ‘Src’: source code availability (cf. C9). ‘Depend.’: dependencies on external software (cf. C10). ‘Scheme’: format of the annotation schema (cf. C16). ‘Input’: format for the input text (cf. C14).‘Output’: export format for the annotationsçfile, database (DB), in-line (IL) or stand-off (SO) (cf. C17).‘License’: name of the license under which the tool is available (cf. C13).

U-Compare [51] in terms that users can build their own annotation workflow using UIMA components. Argo is curation oriented but allows manual annotation of text, although full texts might not be visualized properly. The tool only support annotation of named entities and the annotation schema is limited to those entities included in the UIMA-type systems. Therefore, its use might not be intuitive for those not familiar with UIMA. Analysis engines are available to pre-process documents and automatic extraction is performed using ABNER, GENIA tagger and OscarMER, for instance.

Bionotate, University of Granada (Spain) and Harvard Medical School (USA) Bionotate [52] is a web-based tool aimed at annotating snippets in a collaborative way. Authors report on a pilot case study for the annotation of PPIs and gene disease [52] and Bionotate has also been used as part of the GenomeEnvironment-Trait Evidence (GET-Evidence) system [53]. It is an open source tool and features a very simple installation. The annotation scheme is composed of predefined entity types and so-called questions encoding relationships. The existence of a relationship is expressed by the answer to the question, which is limited to a single one per snippet.

Therefore, Bionotate might not be suitable for the annotation of an unlimited number of entities or relationships in a text. The latest version also allows integration of named entity recognition services [54] and assignment of ontology terms to curated entities [55] (not yet available).

Brat, University of Tokyo (Japan) and University of Manchester (UK) Brat [21] is a web-based tool for text annotation which has been used for a variety of projects, including cancer biology, epigenetics and post-translational modifications (EPI) and the BioNLP 2011 Shared Tasks [21]. It has also been employed for annotating stem cell-related documents [56] and anatomical terms [57]. Documents are displayed graphically by sentences in a very appealing manner; interactive annotation of entities and relationships is intuitive and well supported by appropriate shortcut keys. Pre-annotations can be loaded easily using a plain text stand-off format and tools accessible as web services can be integrated for automatic text pre-processing and annotation. At the downside, Brat has difficulties with the annotation of full texts. To work with the tool, we had to split all documents into parts composed at most of 50 sentences [56].


Tool

Fine

Fine

Good Very good

Fine Good

Good

Fine Fine Good

Good Fine

Fine

@Note

Argo

Bionotate Brat

Callisto Djangology

GATE

Knowtator MMAX2 MyMiner

Semantator WordFreak

XConc

Poor

Good Fine

Good Fine Poor

Good

Good Good

Poor Poor

Poor

Poor

Full

Poor (connectors)

Fine (slot filling) Fine (connectors) Poor (matrix check-box) Fine (slot filling) ^

Fine (attributes)

Poor (questions) Very good (connectors) Fine (attributes) ^

^

^

Relat.

^ Sentences, POS tags, parsing ^

^ Tokenization Sentences

GATE components

^ ^

^ Sentences

Sentences, tokens, stopwords, POS tags Sentences, tokens

Pre-pros.

^

^ ^ Dictionary-based, ABNER, Linnaeus BioPortal, cTAKES ^

GATE components

^ ^

Good (XML)

Poor (XML) Fine (XML

Poor (XML) Poor (XML) ^

^ Good (database) ^

Good (XML) Good (TXT)

^

^

Dictionary-based (14 types)

ABNER, GENIA, OscarMER, species ^ Web-services

Pre-ann.

NER

OWL

Prote¤ge¤ ^ OMIM, NCBI taxonomy, UniProt OWL ^

GATE ontology editor

^ ^

^ ^

^

^

Ontol.


^

^ Good (P/R/F, side-by-side comparison) Good (P/R/F, kappa, comparison, consensus) Good (metrics and consensus) Good (kappa, comparison) Fine (time and differences in annotations) ^ ^

Poor (no. equal answers) ^

^

^

IAA

Abbreviations and corresponding criteria are as follows: ‘Speed’: speed of time for performing annotations. It considers the criteria fast annotation (cf.C22), built-in biomedical NER (cf.C31) and pre-annotations (cf.C32).‘Full’: support for full text (cf.C27).‘Relat.’: possibility of annotating relationships (cf.C24).‘Pre-pros.’: integrated pre-processing of the text (e.g. sentence splitting and tokenization) (cf.C30).‘NER’: built-in automatic extraction of biomedical named entities (cf.C31).‘Pre-Ann’: easiness to import pre-annotations (cf.C32).‘Ontol.’: supports of ontologies (cf.C25).‘IAA’: support for inter-annotator agreementçprecison (P), recall (R), f-score (F) (cf. C34).

Speed

Tool

Table 3: Comparison of all tools according to selected functional criteria

page 6 of 14 Neves and Leser


page 7 of 14

Knowtator, Mayo Clinic (USA)

Callisto [58] is a stand-alone Java text annotation framework and customized versions of it have been used for annotating a variety of corpora, including ITI TXM [59], temporal clinical data [60] and also clinical data in [61]. The annotation scheme is configured by developing a custom plug-in based on a DTD file, therefore, users should be familiar with this format. We failed in annotating relationships properly probably due to errors in our DTD, and the system lacks examples of DTDs supporting relationships. Callisto is rather popular in the biomedical domain and supports the annotation of both named entities and relationships.

Knowtator [20] is a general purpose annotation tool which runs in Prote´ge´ Frames (http://protege.stanford.edu/overview/protege-frames.html). It is certainly one of the most popular tools and was used by a number of biomedical annotation projects, such as the CRAFT corpus [65], annotation of electronic patient records in Swedish [66], clinical entities and relationships in the CLEF Corpus [67], semantic analysis of PubMed queries [68, 69], concept analysis [70], radiology reports [71], temporal expressions in medical narratives [72] and drug–drug interactions [73]. Annotation scheme are defined as ontologies and may include hierarchical relationships. Relationships are defined using slot filling, including co-references and relationships constrained to certain entity types. In our hands-on experiments, importing pre-annotations did not work satisfactorily, specially if some manual annotations had already been included in the document. Knowtator includes a consensus module to perform inter-annotator agreement and to create a consensus gold standard corpus.

Djangology, DePaul University (USA) and Northwestern University (USA) Djangology [22] is a web application for collaborative and distributed annotation of documents. It was originally developed for annotation of named entities in medical studies of trauma, shock and sepsis conditions. The tool is based on the Django web framework and annotations are saved in a database (PostgreSQL, MySQL, Oracle or SQLite are supported). Annotations and documents can be easily loaded into the system or through direct connection to the database. Annotation of relationships is not supported. It allows annotation by multiple annotator, which are also easily managed through the web interface. Djangology is capable of computing inter-annotator agreement and features a helpful side-by-side comparison of documents annotated by distinct annotators.

GATE Teamware, University of Sheffield (UK) GATE Teamware is a web-based collaborative annotation and curation environment [23] based on the GATE infrastructure [18]. The system has been used for annotating documents related to fungal enzymes [62], lignocellulose research [63] and for the OT (OrganismTagger) corpus [64]. It allows GATE-based routines for pre-processing and preannotation. Annotation schemes are configured using ontologies (relationships are modeled as properties in the ontology). The tool supports multiple projects with multiple members by the flexible definition and assignment of roles to users. It also offers monitoring of progress and generation of statistics in real time. The complexity of the tool probably renders it unsuitable for small projects.

MMAX2, EML Research (Germany) MMAX2 [19] is a general purpose annotation tool supporting creation, browsing, visualization and querying of annotations. It has been integrated into the JANE active learning framework [74] and as such, it was used for annotating corpora related to gene regulation [14] and hematopoietic stem cell transplantation [75]. Input documents, annotation schemes and stand-off annotations are represented in XML format and it allows annotation of entities and relationships, as well as computation of interannotation agreement. We found usage of the tool rather unintuitive and had difficulties configuring and using MMAX2 for annotation of relationships.

MyMiner, Monash University (Australia), Spanish National Cancer Research Center (Spain), Universite¤ de la Me¤diterrane¤e (France), Indian Institute of Technology (India) MyMiner [76] is a web-based tool supporting a range of tasks in text curation, such as document labeling, annotation of entities and binary relationships and entity normalization. It participated in the BioCreative III IAT [77] and has been used for curation of PPIs [78] and for assisting manual document classification in the BioCreative III ACT [79]. Pre-annotation can be performed using ABNER and Linneaus, and the


Callisto, MITRE Corporation (USA)

page 8 of 14

Neves and Leser

system can be configured for manual annotations of other entity types. Relationships are annotated by means of a matrix with the distinct mentions found in text. While it is suitable for data curation, it is problematic for annotation, as sometimes two entities do interact in some part of a document but not in another. Additionally, a simple inter-annotator agreement functionality is provided for showing differences between annotations.

Semantator, Mayo Clinic (USA) and Lehigh University (USA)

WordFreak, University of Pennsylvania (USA) WordFreak [27] is a Java-based open source annotation tool built around a highly modular plug-in architecture which can be used to integrate tools for pre-annotation. It is one of the most popular tools and has been used in several projects, such as semantic role labeling [82], for the GREC corpus [83, 84], for SNPs in [85], for histone modifications in [86] and for semantic frames [87]. Annotation schemes have to be specified as plug-ins, which means that it might require a certain amount of Java programming. It is not clear whether annotation of complex relationships, such as biological events, can be handled by the tool. We installed and tested WordFreak successfully, but refrained from testing its abilities for adaptation.

XConc Suite, University of Tokyo (Japan) and University of Manchester (UK) XConc Suite [10] is a general purpose annotation tool provided as an Eclipse plug-in. It has been

DISCUSSION We compared 13 tools for annotating biomedical texts with respect to a set of 35 criteria. Our evaluation showed that there is no single tool which supports all use cases with equal robustness. Instead, all tools have their pros and cons, making them more or less suitable depending on the task at hand. In the following, we highlight strengths of different tools by discussing their suitability for common use cases in biomedical annotation. An overview of the features supported by each tool is shown in Figure 1.

Annotation versus curation Most of the tools described in this survey were developed for annotation purposes. Exceptions are @Note, Argo and Bionotate which were originally intended to support curation. Therefore, they contain functionality that typically is not in focus during annotation, such as document classification (cf. [34] for examples of biocuration workflows). Of the other tools, only MyMiner provides a component for document categorization which can be helpful for document triage [34]. Selection of texts is otherwise left to the users. To enable annotation, all tools present large portions of the texts to be curated and let the user select and then annotate tokens. An exception is Bionotate which only shows text snippets (cf. criterion C27). Therefore, it makes annotation of longer texts a bit painful since they need to be broken into smaller pieces, which in turn hinders a quick consideration of the context of a fact.

Supported annotations Three of the 13 tools only support annotation of named entities, namely @Note, Argo and Djangology. This is probably a severe limitation for many projects, as annotation of relationships has become a de facto standard in biomedicine [14, 13,


Semantator [80] is a Prote´ge´ OWL (http://protege. stanford.edu/overview/protege-owl.html) plug-in which allows manual and semi-automatic annotation of texts and was used for annotation of temporal data in clinical narratives [81]. It is similar to Knowtator (cf. above) in that the annotation schema is defined by a hierarchical ontology, but Semantator has the advantage of supporting OWL ontologies. Although the documentation claims annotation of co-references and of relationships, we were not able to perform this functionality during our hands-on experiments. For pre-annotation, users can choose among tools from the BioPortal and the Mayo Clinics Clinical Text Analysis and Knowledge Extraction System.

used for annotating terms in the GENIA corpus [15], meta-knowledge also in the GENIA corpus [88], curating pathway [89] and for the HANAPIN corpus [90]. The tool allows importing ontologies in OWL format and it is equipped with an Ontology viewer. It allows the annotation of named entities as well as complex relationships. However, we found the annotation of complex events rather cumbersome as events are shown below a sentence, instead of being displayed in the text itself, which might also limit its suitability for annotating full texts.


page 9 of 14

91]. The way how annotation of relationships is supported varies considerably between the tools (cf. criteria C24). Bionotate poses questions to the user, while MyMiner generates a matrix of possible relationships, which implies that annotation at the mention level is not possible. Callisto, GATE Teamware, Knowtator, Semantator and WordFreak use slot filling which we found very hard to use in large-scale annotation as no visualization inside the text is provided. Only Brat, MMAX2 and XConc Suite show visual links between the entities directly on the text, although its use in MMAX2 is not straightforward and its visualization in XConc is confusing.

Pre-annotation Clearly, manual annotation can benefit substantially from pre-annotation. If a comprehensive and highquality pre-annotation is available, the task of the annotator changes from careful reading of texts to confirmation or rejection of system-suggested annotation. However, one has to keep in mind that such

an approach is only useful if pre-annotations are of extremely high quality, which still is wishful thinking for all but a few entity types and way beyond the state of the art for relationships. @Note, Argo, MyMiner and Semantator come along with a named-entity module for the recognition of certain biological entity classes (cf. criterion C31). Also the GATE framework contains a set of NER components which can be integrated into GATE Teamware. Finally, Brat allows integration of the output of automatic text annotation and pre-processing tools through web services. Some of the other tools, i.e. Bionotate, Brat, Knowtator, WordFreak and Xconc Suite, allow importing pre-annotated texts, but leave it to the users to compute such pre-annotations with tools of their choice (cf. criteria C14, C17, C32). This can be seen as an advantage—for the experienced user who wants to use her/his favorite tools—or as a disadvantage—for the less experienced user who is not capable of performing independent pre-annotation. In our tests,


Figure 1: Overview of the functionallities provided by each tool. The criteria shown are the following, as well as the corresponding code: ‘Relationships’ (cf. C24), ‘Inter-Annotator Agreement’ (cf. C34), ‘Full Text’ (cf. C27), ‘Web-based’ (cf. C7), ‘Text mining pipeline’ (cf. C32), ‘Ontologies’ (cf. C25) and ‘Built-in Bio-NER’ (cf. C31).

page 10 of 14

Neves and Leser

we could add pre-annotations without problems for Bionotate, Brat and Djangology. But it might as well be feasible using the XML format supported by XConc Suite. In contrary, manipulating the input file failed in the case of Callisto, although others reported that it has been used for validation of pre-annotation [60]. Knowtator is provided with an import functionality but warns user of some risks when using it; actually, we experienced severe problems while trying it.

Simple versus comprehensive

Popularity and support One way of assessing tools is their recent popularity. This may be estimated from citation counts or from concrete reports on recent usage for annotation of biomedical corpora. MMAX2 (87), Knowtator (62), WordFreak (35) and Callisto (16) lead in terms of citations [counts are according to Google Scholar (as of August 2012)] (cf. criterion C5). When taking into account their usage for the biomedical domain, we

Support for full texts The use of full texts is a crucial issue in biomedical research, as studies have proven that the differences between results obtained only from abstracts compared to full texts are significant [92, 93]. All standalone tools, as well as Djangology and WordFreak, support visualization and annotation of full texts, although sometimes the text loses its format when imported into a XML or in plain text formats (cf. criterion C27). We found XConc Suite to be less suitable for full texts as it displays relationships below each sentence, which is confusing when working with long documents. For Bionotate or Brat, texts must be broken into smaller pieces to be handled adequately. In our experience with Brat for the CellFinder corpus [56], we had to split all documents into parts composed at most of 50 sentences. Full texts are also not correctly displayed in Argo and MyMiner, whose text area seems to be limited to a fixed size. Finally, @Note and MyMiner claim to support PDF documents, functionality which has not been assessed in this survey (cf. criterion C14).

Software architecture and availability of code An important difference between tools is their architecture and the platforms they support. Actually, all tools we considered are generally platform independent (cf. criterion C8). However, as some run as


Biomedical annotation may be carried out by a range of different users. Some users are proficient in computer usage and experienced in working with natural language texts. Thus, they often seek a tool with rich functionality and usually accept the necessity to invest a certain amount of work for adapting or configuring the tool to specific needs. Other users are less experienced and prefer tools which are simple to use and provide all functionality out-of-the-box. Table 3 and Figure 1 indicate that GATE, Knowtator and Semantator are the most comprehensive ones in terms of functionality. They support all standard formats (cf. criteria C14, C15, C16 and C17), are able to integrate ontologies to define annotation vocabulary (cf. criterion C25), allow annotation of entities and relationships (cf. criterion C24) and also adequately support projects with many annotators (cf. criteria C33 and C34). Besides, also having comprehensive annotation guidelines is important; however, supporting their creation is beyond the scope of typical annotation tools. Regarding simplicity, we recommend Brat, Bionotate, Djangology, Knowtator and MyMiner. These tools are simple to install (cf. criterion C12), can be configured easily (cf. criterion C1) and provide good documentation (cf. criterion C2) or self-explaining examples. Still, defining a proper annotation schema was not simple with any of the tools.

found nine reports for Knowtator, seven for WordFreak, four for XConc, three for Callisto, Brat and GATE and only two for MMAX2 (cf. criterion C6). However, tools which have been published only recently, like Brat and MyMiner, certainly have a disadvantage in both these criteria. Another important factor is the quality of documentation and the liveliness of a tool, reflected in an ongoing development and active user support by mailing lists or forums (cf. criteria C2 and C3). Some of the tools, especially Callisto, Knowtator and WordFreak, seem not to be maintained anymore and their last versions date from more than a couple of years ago. Callisto’s website even became unavailable recently. Also questions posted to the forums of Knowtator, MMAX2 and WordFreak have not been answered by the developers within the last months. In contrast, Brat, GATE, Knowtator, MyMiner, Semantator and XConc provide very good documentation and particularly Brat and GATE have an active developer base.


SUPPLEMENTARY DATA Supplementary data are available online at http:// bib.oxfordjournals.org/.

Key Points There exist a multitude of tools available for annotating biomedical corpora as an important resource in text mining. None of these tools satisfies all wishes and needs, but comprehensive and easy-to-use solutions exist for many use cases. Most of the tools are released as open source which may compensate deficiencies in functionality and facilitates integration with other systems or frameworks. Only few tools allow the use of ontologies, specially standard formats such as OBO and OWL. The number of tools offered as web application increases.These completely relieve from software installation and maintenance, but store annotations on remote servers.

Acknowledgements We are thankful for the valuable suggestions, comments and corrections provided by Philippe Thomas and Michael Weidlich.

FUNDING CONCLUSIONS In this work, we surveyed 13 tools suitable for manual annotation of biomedical texts. These tools exhibit a large and diverse class of features and differ substantially in many aspects, such as the basic architecture (web-based or stand-alone), the types of facts that can be annotated (only fixed entities, configurable entities, relationships), their support for larger annotation projects (inter-annotator agreement), maturity of the software and support from the developers and handling of pre-annotations. We presented a brief description of each tool, evaluated them according to a comprehensive set of pre-defined criteria, reported on hands-on experiences for most of them and discussed them according to common situations in biomedical annotations. Our study shows that there is no perfect tool which would be recommendable in any situation. Nonetheless, we believe that as long as the requirements of a project are not too special, researchers aiming at annotating a biomedical corpus should be able to find a tool which complies with their needs. However, non-standard requirements in all cases require adaptation at the code level. Overall, we hope that this survey raises the awareness of what is already out there and helps to diminish the resources going into costly developments of functionality already available for free.

This work was supported by the Deutsche Forschungsgemeinschaft (DFG) [LE 1428/3-1].

References 1.

Krallinger M, Morgan A, Smith L, et al. Evaluation of text-mining systems for biology: overview of the second biocreative community challenge. Genome Biol 2008;9:S1. 2. Kim J-D, Nguyen N, Wang Y, et al. The genia event and protein coreference tasks of the BioNLP shared task 2011. BMC Bioinformatics 2012;13:S1. 3. Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 2005;6:357–69. 4. Ananiadou S, Pyysalo S, Tsujii J, et al. Event extraction for systems biology by text mining the literature. Trends Biotechnol 2010;28:381–90. 5. Zhou D, He Y. Extracting interactions between proteins from the literature. J Biomed Inform 2008;41:393–407. 6. Sutton C, Mccallum A. Introduction to Conditional Random Fields for Relational Learning. Cambridge, MA, USA: MIT Press, 2006. 7. Leaman R, Gonzalez G. BANNER: an executable survey of advances in biomedical named entity recognition. In: Proceedings of the Pacific Symposium of Biocomputing 2008; p. 652–63. 8. Tikk D, Thomas P, Palaga P, etal. A comprehensive benchmark of kernel methods to extract protein–protein interactions from literature. PLoS Comput Biol 2010;6:e1000837. 9. Bjo¨rne J, Ginter F, Pyysalo S, et al. Complex event extraction at PubMed scale. Bioinformatics 2010;26:i382–90. 10. Kim J-D, Ohta T, Tateisi Y, et al. Genia corpus—a semantically annotated corpus for bio-textmining. Bioinformatics 2003;19:i180–2.


plug-ins or extensions of other, often highly complex systems (GATE, Prote´ge´ or Eclipse), installation may be involved and require the pre-installation of other packages (cf. criterion C10). About half of the tools run in a Web environment, namely Argo, Bionotate, Brat, Djangology, GATE Teamware and MyMiner (cf. criterion C7). Argo and MyMiner can be used without any installation, which implies that annotated texts in first place are hosted on an external server. This might be considered as a problem when confidentiality or licensing are an issue. On our evaluation, we also put a special focus on extensibility, such as the ability of a tool to be used for annotation other than the pre-configured facts (cf. criterion C16). However, in some cases, users might want to perform more radical changes, for instance to change the user interface or to support other visualization paradigms. In such cases, availability of the source code is a prerequisite (cf. criterion C9). Source code (currently) is not available for @Note, Argo, MyMiner and Semantator.

page 11 of 14

page 12 of 14

Neves and Leser 26. Liakata M, Claire Q, Soldatova LN. Semantic annotation of papers: interface & enrichment tool (sapient). In: Proceedings of the BioNLP 2009 Workshop, Boulder, Colorado, 2009. p.193–200. 27. Morton T, Civita JL. Wordfreak: an open tool for linguistic annotation. In: NAACL-Demonstrations ’03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Demonstrations, 2003. Vol. 4. 28. Uren V, Cimiano P, Iria J, et al. Semantic annotation for knowledge management: requirements and a survey of the state of the art. J Web Semantics 2006;4:14–28. 29. Dipper S, Go¨tze M, Stede M. Simple annotation tools for complex annotation tasks: an evaluation. In: Proceedings of the LREC Workshop on XML-based Richly Annotated Corpora (LREC’04), Lisbon, Portugal, 2004. European Language Resources Association (ELRA). 30. Hermjakob H, Montecchi Palazzi L, Lewington C, et al. IntAct: an open source molecular interaction database. Nucleic Acids Res 2004;32:D452–5. 31. Winnenburg R, Wa¨chter T, Plake C, et al. Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief Bioinform 2008;9:466–78. 32. Wiegers T, Davis A, Cohen KB, et al. Text mining and manual curation of chemical-gene-disease networks for the comparative toxicogenomics database (CTD). BMC Bioinformatics 2009;10:326. 33. Mu« ller H-M, Kenny EE, Sternberg PW. Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol 2004;2:e309. 34. Hirschman L, Burns GAPC, Krallinger M, et al. Text mining for the biocuration workflow. Database 2012; 2012:bas020. 35. Petasis G, Karkaletsis V, Paliouras G, etal. Using the ellogon natural language engineering infrastructure. In: Proceedings of theWorkshop on Balkan Language Resources andTools,1st Balkan Conference in Informatics (BCI2003).Thessaloniki, Greece, 2003. 36. Orsan C. PALinkA: a highly customisable tool for discourse annotation. In: Proceedings of the 4th SIGdial Workshop on Discourse and Dialogue, ACL03, 2003. p.39–43. 37. Burchardt A, Erk K, Frank A, et al. SALTO: a versatile multi-level annotation tool. In: Proceedings of LREC-2006, Genoa, Italy, 2006. 38. Kaplan D, Iida R, Tokunaga T. Slat 2.0: corpus construction and annotation process management. In: Proceedings of the 16th Annual Meeting of The Association for Natural Language Processing (NLP2010), 2010. p.510–3. 39. O’Donnell M. Demonstration of the UAM corpus tool for text and image annotation. In: Proceedings of the 46th Annual Meeting of theAssociation for Computational Linguistics on Human Language Technologies: Demo Session, HLT-Demonstrations ’08, 2008. p.13–6. Association for Computational Linguistics, Stroudsburg, PA, USA. 40. Prasad R, McRoy S, Frid N, et al. The biomedical discourse relation bank. BMC Bioinformatics 2011;12:188. 41. Bossy R, Jourde J, Manine A-P, et al. Bionlp shared task— the bacteria track. BMC Bioinformatics 2012;13:S3. 42. Ne´dellec C. Learning language in logic–genic interaction extraction challenge. In: Proceedings of the Learning Language in Logic 2005 Workshop at the International Conference on Machine Learning, 2005.


11. Gerner M, Nenadic G, Bergman C. LINNAEUS: a species name identification system for biomedical literature. BMC Bioinformatics 2010;11:85. 12. Klinger R, Kola´rˇik C, Fluck J, et al. Detection of IUPAC and IUPAC-like chemical names. Bioinformatics 2008;24: i268–76. 13. Pyysalo S, Airola A, Heimonen J, et al. Comparative analysis of five protein–protein interaction corpora. BMC Bioinformatics 2008;9:S6. 14. Buyko E, Beisswanger E, Hahn U. The GeneReg corpus for gene expression regulation events: an overview of the corpus and its in-domain and out-of-domain interoperability. In: Chair NCC, et al, (ed) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, 2010. European Language Resources Association (ELRA). 15. Kim J-D, Ohta T, Tsujii J. Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 2008; 9:10. 16. Segura-Bedmar I, Martıńez P, de Pablo-Sanchez C. Extracting drug–drug interactions from biomedical texts. BMC Bioinformatics 2010;11:P9. 17. Cohen KB, Ogren PV. Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases 2005; p. 38–45. 18. Cunningham H, Maynard D, Bontcheva K, et al. Text Processing with GATE (Version 6) 2011. 19. Mu« ller C, Strube M. Multi-level annotation of linguistic data with MMAX2. In: Braun S, Kohn K, Mukherjee J, (eds). CorpusTechnology and Language Pedagogy: New Resources, New Tools, New Methods. Frankfurt a.M., Germany: Peter Lang, 2006, 197–214. 20. Ogren PV. Knowtator: a prote´ge´ plug-in for annotated corpus construc-tion. In: Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 2006. p.273–5. Morristown, NJ, USA: Association for Computational Linguistics. 21. Stenetorp P, Pyysalo S, Topic´ G, et al. Brat: a web-based tool for nlp-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, 2012. p.102–7. Avignon: Association for Computational Linguistics. 22. Apostolova E, Neilan S, An G, et al. Djangology: a light-weight web-based tool for distributed collaborative text annotation. In: Chair NCC, et al, (ed) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, 2010. European Language Resources Association (ELRA). 23. Bontcheva K, Cunningham H, Roberts I, et al. Web-based collaborative corpus annotation: requirements and a framework implementation. In: Proceedings of ‘‘New Challenges for NLP Frameworks’’ workshop at Language Resources and Evaluation Conference (LREC) 2010. 24. Rak R, Rowley A, Black W, et al. Argo: an integrative, interactive, text mining-based workbench supporting curation. Database 2012;2012:bas010. 25. Rinaldi F, Clematide S, Schneider G, et al. ODIN: an advanced interface for the curation of biomedical literature. In: BioCuration 2010. Tokyo, Japan.


61.

62.

63.

64.

65. 66.

67.

68.

69.

70.

71.

72.

73.

74.

75.

Proceedings of the 4th UK e-Science All Hands Meeting, Nottingham, UK, 2005. Zhang Y, Patrick J. Extracting semantics in a clinical scenario. In: Roddick JF, Warren JR, (eds). Australasian Workshop on Health Knowledge Management and Discovery (HKMD 2007) 2007. Vol. 68 of CRPIT, p.241–7. ACS, Ballarat, Australia. Meurs M-J, Murphy C, Naderi N, etal. Towards evaluating the impact of semantic support for curating the fungus scientific literature. In: The 3rd Canadian Semantic Web Symposium (CSWS2011), Vancouver, British Columbia, 2011. Meurs M-J, Murphy C, Morgenstern I, et al. Semantic text mining for lignocellulose research. In: Proceedings of the ACM fifth international workshop on Data and text mining in biomedical informatics, DTMBIO ’11, 2011. p.35–42. ACM, New York, NY. Naderi N, Kappler T, Baker CJO, et al. Organismtagger: detection, normalization and grounding of organism entities in biomedical documents. Bioinformatics 2011;27:2721–9. Bada M, Eckert M, Evans D, et al. Concept annotation in the craft corpus. BMC Bioinformatics 2012;13:161. Velupillai S, Dalianis H, Hassel M, etal. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and f-measure in a manual and computerized annotation trial. Int J Med Inform 2009;78: e19–26. Roberts A, Gaizauskas R, Hepple M, et al. The clef corpus: semantic annotation of clinical text. AMIA Annu Symp Proc 2007;625–9. Dogan RI, Murray GC, Ne´veól A, et al. Understanding PubMed user search behavior through log analysis. Database 2009;2009:bap018. Ne´veól A, Doan RI, Lu Z. Semi-automatic semantic annotation of PubMed queries: a study on quality, efficiency, satisfaction. J Biomed Inform 2011;44:310–8. Hunter L, Lu Z, Firby J, et al. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 2008;9:78. Rubin D, Wang D, Chambers D, et al. Natural language processing for lines and devices in portable chest X-rays. AMIA Annu Symp Proc 2010;692–6. Reeves RM, Ong FR, Matheny ME, et al. Detecting temporal expressions in medical narratives. Int J Medl Inform 2012; doi:10.1016/j.ijmedinf.2012.04.006 (Advance access publication 15 May 2012). Boyce R, Gardner G, Harkema H. Using natural language processing to extract drug–drug interaction information from package inserts. In: BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing, 2012. p.206–13. Association for Computational Linguistics, Montreál, Canada. Tomanek K, Wermter J, Hahn U. Efficient annotation with the Jena ANnotation Environment (JANE). In: Proceedings of the Linguistic AnnotationWorkshop, LAW ’07, 2007. p.9–16. Association for Computational Linguistics, Stroudsburg, PA, USA. Hahn U, Beisswanger E, Buyko E, et al. Semantic annotations for biology: a corpus development initiative at the Jena University Language & Information Engineering (JULIE) lab. In: Proceedings of the Sixth International Conference on


43. Ciccarese P, Ocana M, Clark T. Open semantic annotation of scientific publications using domeo. J Biomed Semantics 2012;3:S1. 44. Karamanis N, Lewin I, Seal R, et al. Integrating natural language processing with flybase curation 2007. In: Proceedings of PSB 2007, 2007;245–56. 45. Karamanis N, Seal R, Lewin I, et al. Natural language processing in aid of flybase curators. BMC Bioinformatics 2008;9:193. 46. Couto F, Silva M, Lee V, et al. GOAnnotator: linking protein go annotations to evidence text. J Biomed Discov Collab 2006;1:19. 47. Donaldson I, Martin J, de Bruijn B, et al. Prebind and textomy—mining the biomedical literature for protein–protein interactions using a support vector machine. BMC Bioinformatics 2003;4:11. 48. Lourenco A, Carreira R, Carneiro S, et al. @note: a workbench for biomedical text mining. J Biomed Inform 2009;42: 710–20. 49. Carreira R, Carneiro S, Pereira R, etal. Semantic annotation of biological concepts interplaying microbial cellular responses. BMC Bioinformatics 2011;12:460. 50. Carneiro S, Lourenco A, Ferreira E, et al. Stringent response of Escherichia coli: revisiting the bibliome using literature mining. Microb Inform Exp 2011;1:14. 51. Kano Y, Baumgartner WA, McCrohon L, etal. U-compare: share and compare text mining tools with UIMA. Bioinformatics 2009;25:1997–8. 52. Cano C, Monaghan T, Blanco A, et al. Collaborative text-annotation resource for disease-centered relation extraction from biomedical text. J Biomed Inform 2009;42: 967–77. 53. Ball MP, Thakuria JV, Zaranek AW, et al. A public resource facilitating clinical use of genomes. Proc Natl Acad Sci USA 2012;109:11920–7. 54. Cano C, Labarga A, Blanco A, et al. Collaborative semi-automatic annotation of the biomedical literature. In: 11th International Conference on Intelligent Systems Design and Applications (ISDA 2011). Cordoba, Spain, 2011. 55. Cano C, Labarga A, Blanco A, etal. Social and semantic web technologies for the text-to-knowledge translation process Biomedicine. Biomedical Engineering, Trends, Research and Technologies. InTech Open, 2011. 56. Neves M, Damaschun A, Kurtz A, et al. Annotating and evaluating text for stem cell research. In: Proceedings of the Third Workshop on Building and Evaluation Resources for Biomedical Text Mining (BioTxtM 2012) at Language Resources and Evaluation (LREC), Istanbul,Turkey, 2012. p.16–23. 57. Ohta T, Pyysalo S, Tsujii J, et al. Open-domain anatomical entity mention detection. In: Proceedings of ACL 2012 Workshop on Detecting Structure in Scholarly Discourse (DSSD), 2012. p.27–36. 58. Day D, McHenry C, Kozierok R, et al. Callisto: a configurable annotation workbench. In: Proceedings of Language Resources and Evaluation Conference (LREC), 2004. 59. Alex B, Grover C, Haddow B, etal. The ITI TXM corpora: tissue expressions and protein-protein interactions. In: Proceedings of Workshop on Building and Evaluating Resources for BiomedicalText Mining, LREC 2008, Marrakesh, Morocco, 2008. 60. Harkema H, Setzer A, Gaizauskas R, et al. Mining and modelling temporal clinical data. In:Cox S, (ed). In:

page 13 of 14

page 14 of 14

76.

77. 78.

79.

81.

82.

83.

84.

Language Resources and Evaluation (LREC’08), 2008. European Language Resources Association (ELRA), Marrakech, Morocco. Salgado D, Krallinger M, Depaule M, et al. Myminer: a web application for computer-assisted biocuration and text annotation. Bioinformatics 2012;28:2285–7. Arighi C, Lu Z, Krallinger M, et al. Biocreative III interactive task: an overview. BMC Bioinformatics 2011;12:S4. Krallinger M, Leitner F, Vazquez M, et al. How to link ontologies and protein–protein interactions to literature: text-mining approaches and the biocreative experience. Database 2012;2012:bas017. Chatr-aryamontri A, Winter A, Perfetto L, et al. Benchmarking of the 2010 biocreative challenge III text-mining competition by the biogrid and mint interaction databases. BMC Bioinformatics 2011;12:S8. Song D, Chute CG, Tao C. Semantator: a semi-automatic semantic annotation tool for clinical narratives. In: 10th International SemanticWeb Conference (ISWC2011), 2011. Tao C, Wongsuphasawat K, Clark K, et al. Towards event sequence representation, reasoning and visualization for EHR data. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, IHI ’12, 2012. p.801–6. ACM, New York, NY, USA. Chou W-C, Tsai RT-H, Su Y-S, et al. A semi-automatic method for annotating a biomedical proposition bank. In: LAC ’06 Proceedings of theWorkshop on Frontiers in Linguistically Annotated Corpora, 2006. Thompson P, Iqbal SA, McNaught J, et al. Construction of an annotated corpus to support biomedical information extraction. BMC Bioinformatics 2009;10:349. Sasaki Y, Thompson P, Cotter P, et al. Event frame extraction based on a gene regulation corpus. In: Proceedings of the 22nd International Conference on Computational Linguisticsç Vol. 1. COLING ’08, 2008. p.761–8. Association for Computational Linguistics, Stroudsburg, PA, USA.

85. Klinger R, Furlong LI, Friedrich CM, et al. Identifying gene-specific variations in biomedical text. J Bioinform Comput Biol 2007;5:1277–96. 86. Kola´rˇik C, Klinger R, Hofmann-Apitius M. Identification of histone modifications in biomedical text for supporting epigenomic research. BMC Bioinformatics 2009; 10(Suppl 1):S28. 87. Thompson P, Cotter P, McNaught J, et al. Building a bio-event annotated corpus for the acquisition of semantic frames from biomedical corpora. In:Chair NCC, et al, (ed). In: Proceedings of the Sixth International Language Resources and Evaluation (LREC’08), 2008. European Language Resources Association (ELRA), Marrakech, Morocco. 88. Thompson P, Nawaz R, McNaught J, et al. Enriching a biomedical event corpus with meta-knowledge annotation. BMC Bioinformatics 2011;12:393. 89. Oda K, Kim J-D, Ohta T, et al. New challenges for text mining: mapping between text and manually curated pathways. BMC Bioinformatics 2008;9:S5. 90. Batista-Navarro RT, Ananiadou S. Building a coreferenceannotated corpus from the domain of biochemistry. In: Proceedings of BioNLP 2011 Workshop, BioNLP ’11, 2011. p.83–91. Association for Computational Linguistics, Stroudsburg, PA, USA. 91. Kim J-D, Ohta T, Pyysalo S, et al. Extracting biomolecular events from literature the bionlp09 shared task. Comput Intell 2011;27:513–40. 92. Cohen KB, Johnson H, Verspoor K, etal. The structural and content aspects of abstracts versus bodies of full text journal articles are different. BMC Bioinformatics 2010;11:492. 93. Kim J-D, Wang Y, Takagi T, et al. Overview of genia event task in bionlp shared task 2011. In: Proceedings of BioNLP Shared Task 2011 Workshop, 2011. p.7–15. Association for Computational Linguistics, Portland, OR, USA.


80.

Neves and Leser