Trends in biomedical text mining - mavir

Trends in biomedical text mining

Martin Krallinger, Spanish National Cancer Research Centre, (CNIO), Madrid, Spain

November 18th, 2013 Jornadas MAVIR

Martin Krallinger


Martin Krallinger

•  Background , introduction, biomedical annotations •  BioCreative: Critical assessment of Information Extraction in Biology • Recognition of Chemical compound and Drug mentions • Text mining applications and the extraction of drug-induced adverse effects • Text Mining of pathways and enzymatic reactions • Text mining and ranking of mutations, proteins for cancer • Detection or organism and species in text • MyMiner tool for building biomedical text annotations

Martin Krallinger


Unstructured Text (implicit knowledge)

Information Retrieval

Knowledge Discovery

Biomedical Databases

Structured database records, ontologies and controlled vocabullaries

Structured content (explicit knowledge)

Advanced Information Retrieval

Named Entity Recognition Information extraction

Semantic metadata

Adapted from : Text Mining for Biomedicine: Techniques & tools, Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki, Yoshimasa Tsuruoka

3


Martin Krallinger


Martin Krallinger


Martin Krallinger

Characteristics of biomedical language   Heavy use of domain specific terminology (12% biochemistry related technical terms*), examples:  chemoattractant, fibroblasts, angiogenesis

  Polysemic words (term homonymy), examples: APC: (1) Argon Plasma Coagulation (2) Activated Protein C; or teashirt: (1) a type of cloth (2) a gene name (tsh).

  Heavy use of acronyms, examples: Activated protein C (APC) , or vascular endothelial growth factor (VEGF)

  Most words with low frequency (data sparseness) 6 Netzel R, Perez-Iratxeta C, Bork P, Andrade MA. The way we write. EMBO Rep. 2003 May;4(5):446-51


Martin Krallinger

Ambiguity and variability   New names & terms created (ad hoc), they change, example:   ‘This disorder maps to chromosome 7q11-21, and this locus was named CLAM. ‘[PMID:12771259 ]

  Lexical & typographical variants (e.g. in writing gene names), example: TNF-alpha and TNF alpha (without hyphen)   Term synonymy: different words with same meaning  Different writing styles (native languages): syntactic and semantic and word usage implications.   Heavy use of referring expressions (anaphora, cataphora and ellipsis) and inference, example:  Glycogenin is a glycosyltransferase.  It functions as the autocatalytic initiator for the synthesis of glycogen 7 in eukaryotic organisms.


Biomedical text mining: historical view

Martin Krallinger

Martin Krallinger


Biomedical text mining applications Acronym & term extraction

Interactions

Function Disease

Groups, lists

Entity recognition

Sequence

Krallinger M, Valencia A, Hirschman L. Linking Krallinger,M. Leitner,genes F. andto Valencia,A. Analysis of biological processes and diseases using text mining literature: text mining, information extraction, and retrieval approaches. Bioinformatics Methods in Clinical Research. Methods in MolecularBiology 593, To appear applications for biology. Genome Biol. 2008;9 Suppl 2:S8.

Retrieval, classification

9


Martin Krallinger

Importance of scientific literature data   Life sciences -> generates heterogeneous data types (articles, sequence, structure,..)   Natural language used for communicating scientific discoveries.   Natural language texts amenable for direct human interpretation   Natural language not only in scientific articles, but also patents, reports, newswire, database records, controlled vocabularies (GO terms),…   Functional information & annotations directly or indirectly derived from the literature (curation and electronic annotation).   Databases are generally only capable of covering a small fraction of the biological " context information that can be encountered in the literature."   Contextual information of experimental results (cell line, tissue, conditions).   User demands of better information access (beyond keyword searches)   Rapid growth of information, manual information extraction not efficient.

10


Martin Krallinger

Biomedical literature & scientific discovery              

Define the biological question" Biology Select the actual target being studied" Extract information relevant for experimental set up" Locate relevant resources" Essential to understand and interpret the resulting data" Draw conclusions about new discoveries" Communicated to the scientific community using publications in peer-reviewed journals

 

Resource for clinical decision support in evidencebased clinical practice Clinics Useful information for diagnostic aids

 

       

 

Drug discovery and target selection Pharma Identifying adverse drug effect Competitive intelligence and knowledge management

Global view of the current research state & monitor trends to ensure optimal resource allocation Funding

Find domain experts for specific topics for the peer-review process & detecting potential cases of plagiarism Publ.

11


Martin Krallinger

Bio-databases annotations

12


Martin Krallinger

Textual data in biological annotations

13


Martin Krallinger

Controlled vocabularies and ontologies

14

Martin Krallinger


Biocuration: manual literature annotations

Bio-entities

Scientific Literature

Controlled vocabularies

Database curator Annotation Databases

15


Martin Krallinger

Ontologies: growing in content and terms

16


Increasing number of ontologies

Martin Krallinger > 130

Formats (OBO, OWL, XML, RDF) (http://www.obofoundry.org)

17


Martin Krallinger

OLS: browsing ontologies

www.ebi.ac.uk/ontology-lookup


Biocuration: main tasks

Martin Krallinger


Martin Krallinger


Martin Krallinger

Databases using/exploring text-mining help

Text mining for annotation databases: where and when

23 Krallinger, Martin. A Framework for BioCuration Workflows (part II). Available from Nature Precedings (2009)


Biocuration workflows

Provided by Andrew Winter BioGRID database (http://wiki.thebiogrid.org/doku.php/curation_description)

Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger

28


Martin Krallinger

29


Martin Krallinger

• Compare different methods and strategies • Reproduce performance of systems on common data • Provide useful data collections: Gold Standard data • Explore meaningful evaluation strategies and tools • Determine the state of the art • Monitor improvements in the field • Point out needs of the user community • Promote collaborative efforts Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008;9 Suppl 2:S1.

30

Martin Krallinger


BIO-NLP related BIOINFORMATICS

CASP CAMDA

NLP/IR/IE

BIOCREATIVE BioNLP Shared Task

TREC

DDI challenge

CAPRI

TREC Genomics track

GASP

LLL05 challenge

GAW

MUC

ACE

SEMEVAL

KDD cup 2002

SENSEVAL

i2b2 (medical) PTC JNLPBA shared task

KDD: Knowledge Discovery and Data mining

CASP: Critical assessment of Protein Structure Prediction CAMDA: Critical Assessment of Microarray Data Analysis CAPRI: Critical Assessment of Prediction of Interactions GASP: Genome Annotation Assessment Project GAW: Genome Access Workshop PTC: Predictive Toxicology Challenge

RTE

Adapted from Krallinger et al. Genome Biology 2008 9(Suppl 2):S1

JNLPBA: Joint workshop on Natural Language Processing in Biomedicine TREC: Text Retrieval conference MUC: Message Understanding conference LLL05: Genic interaction extraction challenge RTE: Textual Entailment challenge


Biomedical text mining: historical view

Martin Krallinger


Martin Krallinger

Critical Assessment of Information Extraction systems in Biology •  Community-wide effort, several organizers, different tasks •  Evaluation of text mining & information extraction applied to biological biomedical domain. • Increasing nr. of groups working in the area of text mining. •  Need of common standards, shared evaluation criteria to enable comparison: Avoid One system = one evaluation data set •  Promote development of real applications, tools •  Assessment of scientific progress: Monitor improvements •  Involve domain experts (end users) and biological database curators and text mining experts Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008;9 Suppl 2:S1.

33


Martin Krallinger


Martin Krallinger

35


BioCreative II

BioCreative I      

Gene mention recognition Gene normalization GO annotation

     

Gene mention recognition Gene normalization Protein interaction

BioCreative II.5  

Protein interaction

Martin Krallinger

BioCreative III      

Protein interaction Gene normalization User interaction


Martin Krallinger

Evaluation metrics (I) FN false negatives - incorrect negative classification results (type II errors) FP false positives - incorrect positive classification results (type I errors) TN true negatives - correct negative classification results (correct rejection) TP true positives - correct positive classification results (correct hit)

AUC iP/R


Martin Krallinger

Evaluation metrics Accuracy a is the percentage of correctly labeled results over all results

Matthew’s correlation coefficient (MCC) is a balanced measure of a test’s results

Area under the ROC curve (AUC ROC) can evaluate ranked (ordered) results Other metrics: Precision Recall curves, Average Precision (AP), threshold average precision (TAP) and agreement measures


Martin Krallinger

•  Coordinated by the National Centre of Biotechnology Information (NCBI) •  Inspired by the BioCreative IA task: Detection of Gene mentions •  Highest f-score*: 87.21 •  21 participating teams •  Many systems used: •  Conditional Random Fields (CRF);Support Vector Machines (SVM) •  POS tagging, Stemming •  Exploited systems include: Mallet and the GENIA tagger •  Additional resources: Medline, HUGO and Medpost *f-score: harmonic mean of precision and recall


Martin Krallinger

•  Coordinated by MITRE (Lynette Hirschman), similar to BioCreative 1B task •  A total of 20 groups submitted predictions (up to three runs/ team) •  Normalization: association of entity to database records (sequence) •  Extract a list of gene identifiers of genes mentioned in PubMed abstracts •  Highest f-score around 0.8 •  Normalization reference database: human genes/proteins to EntrezGene •  Dictionary look-up


Martin Krallinger

**Krallinger M, Leitner F, Vazquez M, Salgado D, Marcelle C, Tyers M, Valencia A, Chatr-aryamontri A. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience. Database (Oxford). 2012 Mar 21;2012:bas017


Martin Krallinger

**Krallinger M, Leitner F, Vazquez M, Salgado D, Marcelle C, Tyers M, Valencia A, Chatr-aryamontri A. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience. Database (Oxford). 2012 Mar 21;2012:bas017


Martin Krallinger

43


Martin Krallinger

Protein-Protein Interaction (PPI) Specific physical contacts with molecular binding between proteins, both transient as well we stable contacts. PPI information: literature, large scale experiments, bioinformatics predictions Public repositories integrate information from large- and small-scale PPI experiments reported in the scientific literature Pathguide contains information about 325 biological pathway related resources and molecular interaction related resources (pathguide.org) Annotation effort shared by various interaction databases: BioGRID, MINT, BIND, CORUM, DIP,HAPPI,HPRD,I2D,InnateDB,IntAct,InteroPorc, iRefIndex, iRefWeb, MatrixDB,MIPS, PC, PIMRider Common vocabulary and standards to improve consistency and Efficiency of PPI annotations: PSI-MI


Martin Krallinger

IMS

45

Martin Krallinger


ACT: Article categorization task •  Binary classification of recent PubMed abstracts as PPI relevant •  Predictions provided together with a confidence score in the ]0..1] range •  Evaluation based on AUC iP/R (also additional analysis, f-score, accuracy) •  NOT balanced set, abstracts, journals of biocuration interest •  Exhaustive manual revision by three domain experts and refinement based on database curators of BioGRID and MINT •  IAA pairwise percentage agreement between MINT & BioGRID 95%. •  Article ID ➠ Class ➠ [Rank ➠] Confidence TRAINING SET (Balanced) total size: 2280

DEVELOMENT SET (Unbalanced) total size: 4000

TEST SET (Unbalanced) total size: 6000

+ PPI: 1140 Not PPI: 1140 proportion: 50%

+ PPI: 682 Not PPI: 3318 proportion: 17.05%

+ PPI: 910 Not PPI: 5090 proportion: 15.17%

Example system (specialized): PIE

http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/PIE/

Example system (specialized): PIE

http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/PIE/


Martin Krallinger

IMS OVERVIEW Articles

PSI-MI 2.5 Ontology

PMID:16317001

MI:0017

Martin Krallinger

Trends in biomedical text mining phosphatase assay light scattering filter binding cosedimentation in solution protease assay peptide array fluorescence technology enzymatic study cross-linking study confocal microscopy enzyme linked immunosorbent assay far western blotting cosedimentation through density gradient affinity chromatography technology isothermal titration calorimetry competition binding nuclear magnetic resonance tandem affinity purification fluorescent resonance energy transfer molecular sieving x-ray crystallography protein kinase assay surface plasmon resonance two hybrid anti bait coimmunoprecipitation fluorescence microscopy anti tag coimmunoprecipitation pull down

Test Development Train

0

0,02 0,04 0,06 0,08

0,1

0,12 0,14 0,16 0,18

Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol. 2008;9 Suppl 2:S4.

0,2

BioCreative II and III: Interaction Method results

51


Martin Krallinger

Krallinger M, Leitner F, Vazquez M, Salgado D, Marcelle C, Tyers M, Valencia A, Chatr-aryamontri A. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience. Database (Oxford). 2012 Mar 21;2012:bas017


Martin Krallinger

‘Difficult’ cases ['19481529', 'MI:0424', '1', '0.630389', 'phosphorylated Ser437Ala mutant , suggesting phosphorylation of PACS-2 Ser437 was required for binding 14-3-3 proteins . We then conducted a fluorescence polarization assay to determine quantitatively whether phosphorylated'] ['protein kinase assay'] ['18922473', 'MI:0006', '2', '0.472072315860236', 'Interaction between the endogenous TRAF6 and TAK1 in AML12 cells as determined by immunoprecipitation with anti - TAK1 antibody , followed by anti - TRAF6 Western blot . The TGF - \xce\xb2 treatment was for 30 minutes and the total rabbit IgG \n'] ['anti bait coimmunoprecipitation', 'anti bait coip']


BioCreative IV: CHEMDNER task

Martin Krallinger


http://www.biocreative.org/tasks/biocreative-iv/chemdner/

Martin Krallinger


Martin Krallinger

Krallinger et al. Analysis of biological processes and diseases using text mining approaches. Methods in Molecular Biology 593, 341-382


•  PubChem •  ChEBI •  CHEMBL •  Pathway DBs •  Molecular interaction DBs •  Structure databases •  Toxicology/pharmacogenomics DBs •  Biochemical/ metabolic pathways DBs •  Many others,...

Martin Krallinger


Martin Krallinger

•  Aim: Identify entity mentions in text •  Generic NER: corporate names and places (0.9 Fscore), see Message Understanding Conferences (MUC) •  Biology/chemistry NER: more complex (synonyms, disambiguation, typographical variants, official symbols not used,..). •  Methods: POS tagging, rule-based, flexible matching, statistics, ML (naïve Bayes, ME, SVM, CRF, HMM). •  Important for down-stream text mining: essential step for finding bioentity relevant documents & for automatic extraction of relationships using information extraction (IE) 58


Martin Krallinger

•  Gene and Proteins •  DNA •  RNA •  Cell lines •  Cell types •  Chemical compounds •  Mutations, sequence variations, sequences •  Species and organisms •  Anatomical terms •  Disease terms,…. 59


Martin Krallinger

•  Authors often do not follow official IUPAC nomenclature guidelines •  Chemical compounds/drugs often have many synonyms or aliases (e.g. systematic names, trivial names and abbreviations referring to the same entity). •  Existence of hybrid mentions (e.g. mentions that are partial systematic and partial trivial: semi-systematic). •  Chemical compounds are ambiguous with respect to other entities like gene/protein names or common English words . •  Alternative typographical variants: hyphens, brackets, and spacing. •  Alternative word order variants. •  The ambiguity of chemical acronyms, short formulae and trivial names. •  Identifying new chemical compound names (novel entities). •  A set of specialized word tokenization rules required for chemical terms. 60


Main strategy types

Martin Krallinger


Martin Krallinger

[CDI] chemical document indexing sub-task: given a set of documents to return for each of them a ranked list of unique chemical entities described within each of these documents. [CEM] chemical entity mention recognition sub-task: systems to provide for a given document the start and end character indices corresponding to all the chemical entities mentioned in this document. Up to five different automatic annotations (runs) for each sub-tasks

1) PMID 2) Mention offset -T: title / A: abstract -Start offset -End offset 3) Rank 4) Confidence score

1) PMID 2) Mention (unique) 3) Rank 4) Confidence score


Martin Krallinger

Main metric: Micro-averaged F-score Automated predictions against manual annotations (Gold Standard) Exact match evaluation FN false negatives - incorrect negative classification results (type II errors) FP false positives - incorrect positive classification results (type I errors) TN true negatives - correct negative classification results (correct rejection) TP true positives - correct positive classification results (correct hit)


Martin Krallinger

(1) Corpus selection and sampling (2) Annotation guidelines and their corpus-driven refinements, (3) Entity annotation granularity (4) Human annotator expertise and training (5) Annotation tools and interface (6) Annotation consistency, definition of upper & lower performance boundaries to be expected by tools (7) Corpus format and availability

64

Martin Krallinger


SYSTEMATIC

Systematic names of chemical mentions, e.g. IUPAC and IUPAC-like names.

2-Acetoxybenzoic acid; N-(4-hydroxyphenyl)acetamide; 3,5,4'-trihydroxy-trans-stilbene

IDENTIFIERS

DB ids of chemicals: CAS numbers, Company Registry numbers, PubChem, ChEBI, CHEMBL identifiers

501-36-0445154; CHEMBL 504

FORMULA

Molecular formula, SMILES, InChI, InChIKey

CC(=O)Oc1ccccc1C(=O)O InChI=1S/C9H8O4/c1-6(10) 13-8-5-3-2-4-7(8)9(11)12/h2-5H, 1H3,(H,11,12); C9H8O4

TRIVIAL

Trivial, trade (brand), common/generic names of compounds (also International Nonproprietary Name, British Approved Name & United States Adopted Name)

Aspirin Acylpyrin paracetamol acetaminophen Tylenol

ABBREVIATION

Abbreviations compounds

DMSO; GABA

FAMILY

Chemical families can be linked to chemical structure: Plurals of systematic IUPAC names, general formulas, etc.

Iodopyridazines ; Diphenols quinolines; terpenoids; ROH

MULTIPLE

Chemicals that are not continuous string of characters (E.g.: multiple chemicals disrupted by coordinated clauses)

thieno2,3-d and thieno3,2-d fused oxazin-4-ones

&

acronyms

of

chemicals


Martin Krallinger


Martin Krallinger

• Define what constitutes a chemical entity mention (CEM) • Mentions of practical relevance based on potential target applications • Mentions that can be linked to chemical structure information • Review public/available annotation guidelines for chemicals: annotation manual by Corbett et al as initial reference • CHEMDNER corpus main modifications: • Only Chemical nouns (and specific adjectives, treated as nouns) are tagged (not reactions, prefixes or enzymes) • Reduction in the number of rules • Rules were grouped as Positive, Negative, Orthography, MultiWords • Multiword rules were very simplified: maybe less precise annotation but less error-prone to human annotation • Additional rules for assignment to CEM classes (similar to Klinger et al) • Iterative process → Guidelines were slightly refined after first sample test annotation • Required refinements detected during training/test set annotations will be incorporated in the future release of the CHEMDNER Corpus plus


Martin Krallinger

- User-friendly web-based curation tool - List of available CEM types -  Auto-completion tool -  Possibility to add comments -  Some post-processing steps done with MyMiner and other scripts


Martin Krallinger

69

Martin Krallinger


10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0

TRAIN

DEVELOPMENT

ABBREVIATION

FAMILY

FORMULA

IDENTIFIER

MULTIPLE

SYSTEMATIC

TRIVIAL

NO CLASS

TEST

70


Martin Krallinger

• Estimate of the lower boundary of the expected recall • Vocabulary transfer, defined as the proportion of entities (without repetition) that appear both in the training/development set as well as in the test corpus: 36.34% (train&dev), 27.77% (train) 27.70% (dev) • Dictionary baseline using train/dev vocabulary: precision:57.22%, recall: 57.00%, F-score of 57.11% • Existing tools (out of the box) F-scores: Chemicaltagger (27.46), OSCAR4 (30.54), Jochem (56.40), MeSH term annotations (11.01)

71


Martin Krallinger

• Inter annotator agreement (IAA) study constitutes a sort of upper boundary for the expected systems performance • Agreement between humans manually labeling the data • Useful for estimating the difficulty of the task and quality of annotations • Inter-annotator agreement based on double annotation of 100 abstracts randomly chosen from the entire dataset. • 91% for matching of the manual annotations (regardless CEM class) and 85.26% for exact matching and same CEM class • Most discrepancies - annotations that were missed by either one of them (FN cases) 72


Martin Krallinger

• Test set annotation using an additional curation team to recover potential FN cases • Main annotator team carried out the labeling of the entire CHEMDNER corpus, another secondary team of curators annotated additionally only the test set • Relied primarily on annotations of main annotator team: higher degree of training & provided active feedback for refinement of annotation guidelines • Conflicting annotations between two teams presented to main curation group for manual revision • 1,185 annotations added to original 24, 671 test set (4.08 %) 505 (2.05%) removed: final test set of 25,351 annotations

73

Martin Krallinger

Trends in biomedical text mining Team id 173 177 179 182

Team leader Zhiyong Lu Tolga Can Daniel Lowe Alexander Klenner

Members 3 2 2 2

184

Rafal Rak

2

185 191

S V Ramanan Ana Usié Chimenos

2 5

192

Hua Xu

5

196 197 198 199 207 214 217 219 222 225 231 233 238 245 259 262 263 265 267

Francisco Couto Sérgio Matos Philippe Thomas Matthias Irmer Karin Verspoor Daniel Bonniot de Ruisselet Li LiShuang Madian Khabsa Saber Ahmad Akhondi Daniel Sanchez-Cisneros Donghong Ji Tsendsuren Munkhdalai Hongfang Liu Slavko Zitnik Shuo Xu Asif Ekbal Masaharu Yoshioka shu ching-yao LiLiShuang

3 3 5 6 3 1 6 1 6 4 4 1 5 1 4 3 2 1 5

Type Academic Academic Commercial Academic

Institution

NCBI/NLM/NIH Middle East Technical University, Ankara, Turkey NextMove Software Fraunhofer-Institute for Algorithms & Sci.Comp. National Centre for Text Mining Academic University of Manchester Commercial RelAgent Pvt Ltd Academic Universitat de Lleida The University of Texas Health Science Center at Academic Houston (UTHealth) Academic LASIGE, University of Lisbon Academic University of Aveiro Academic Humboldt-Universität zu Berlin Commercial OntoChem Academic NICTA (National ICT Australia) Commercial ChemAxon Academic DaLian University of Technology Academic The Pennsylvania State University Academic Erasmus MC, Rotterdam The Netherlands Academic Universidad Carlos III & Univ. Autónoma Madrid Academic Wuhan University Academic Chungbuk National University, South Korea Academic Mayo Clinic Academic University of Ljubljana Academic Institute of Scientific &Technical Information of China Academic IIT Patna, India Academic Hokkaido Univeristy, Sapporo, Japan Academic Yuan Ze University 74 Academic DaLian University of Technology

team_231_cdi_run3 team_184_cdi_run4 team_198_cdi_run4 team_231_cdi_run2 team_198_cdi_run5 team_173_cdi_run4 team_231_cdi_run5 team_173_cdi_run3 team_179_cdi_run2 team_198_cdi_run3 team_233_cdi_run5 team_233_cdi_run1 team_179_cdi_run3 team_197_cdi_run5 team_173_cdi_run5 team_231_cdi_run4 team_197_cdi_run4 team_185_cdi_run2 team_233_cdi_run2 team_245_cdi_run3 team_199_cdi_run1 team_214_cdi_run2 team_222_cdi_run1 Team_207_cdi_run2 team_214_cdi_run4 team_217_cdi_run1 team_217_cdi_run3 team_214_cdi_run1 team_219_cdi_run1 team_265_cdi_run1 team_238_cdi_run3 team_219_cdi_run3 team_219_cdi_run5 Team_207_cdi_run1 team_177_cdi_run2 team_191_cdi_run1 team_238_cdi_run4 team_182_cdi_run1 team_225_cdi_run1 team_225_cdi_run4 team_182_cdi_run3 team_225_cdi_run5 team_219_cdi_run2 team_196_cdi_run4 team_182_cdi_run2 team_196_cdi_run1


•  23 teams (91 runs) •  Highest F-score: 88.20 •  Highest precision: 98.66 (recall of 16.65) •  Highest recall: 92.24 (precision of 76.29)

Martin Krallinger

F-score

1

0,9

0,8

0,7

0,6

0,5

0,4

0,3

0,2

0,1

0

75

CHEMDNER

•  P: precision; •  R: recall, •  F1:F-score •  SDs: standard deviation in the bootstrap samples


Martin Krallinger

IAA

Baseline

•  26 teams (106 runs) •  Highest F-score: 87.39 •  Highest precision: 98.05 (recall of 17.90) •  Highest recall: 92.11 (precision of 76.72)

77


•  P: precision; •  R: recall, •  F1:F-score •  SDs: standard deviation in the bootstrap samples

Martin Krallinger

Martin Krallinger


Abbreviations

Family

Run

Recall

Run

Recall

T173_5

91.38

T173_5

90.06

T173_3

91.33

T219_2

89.56

T219_2

91.33

T173_3

87.99

T173_4

88.99

T231_5

87.63

T173_1

87.68

T231_3

86.55

79

Martin Krallinger


Formula

Identifier

Run

Recall

Run

Recall

T173_3

89.37

T173_3

90.06

T173_5

88.09

T173_4

89.86

T173_2

83.94

T173_5

87.52

T231_5

83.82

T231_2

86.16

T173_4

83.18

T219_2

84.79

80

Martin Krallinger


Systematic

Trivial

Run

Recall

Run

Recall

T182_2

94.25

T173_3

95.89

T173_5

94.15

T173_5

95.25

T173_3

94.03

T219_2

94.72

T173_4

92.74

T231_5

93.10

T179_2

92.24

T173_4

92.76

81


Multiple

Run

Recall

T173_5

60.30

T231_3

53.27

T173_3

53.27

T231_2

53.27

T231_5

52.76

Martin Krallinger

No class

(Only 41cases)

82


Run

Recall

T173_3

83.49

T173_5

82.43

T219_2

82.05

T173_4

78.52

T173_1

74.69

Martin Krallinger

•  Only 108 of 25,351 mentions not detected by any teams (joined Recall = 99.57%): for all mentions not only novel 83

Martin Krallinger

Trends in biomedical text mining 1

2

3

4

5

6

7

8

1- Abbreviations 2- Family 3- Formula 4- Identifier 5- Multiple 6- No class 7- Systematic 8- Trivial


Martin Krallinger

FORMULA: highly ambiguous one/two letter mentions 23537166 23591845 23560542 23375796

A:953:954 A:760:761 A:865:866 A:567:568

I P H O

MULTIPLE: in general more difficult 23414837 23414837 23375209

A:152:199 A:24:74 A:1509:1538

triazolo and imidazo dihydropyrazolopyrimidines amido and benzimidazole dihydropyrazolopyrimidines C4/Ci4, C3, C5 acylcarnitines

TRIVIAL: dyes, special morphology names with brackets 23061466 23223414 23580394 23122138

A:875:889 A:497:510 T:28:41 A:274:286

Guangfu base A anatase (101) Squaraine Dye Sepharose 4B

SYSTEMATIC: very few missed cases, difficult very long mentions 2R,4R- isomer of 2-hydroxy-2-(indol-3-ylmethyl)-4-aminoglutaric acid 1-piperazineacetamide, N-[5-(aminocarbonyl) tricyclo[3.3.1.13,7]dec-2-yl]-Œ±,Œ±-dimethyl-4-[5(trifluoromethyl)-2-pyridinyl]


Martin Krallinger

• 18 teams worked before on chemical entity recognition, 9 teams new • All teams used the provided training/development data sets • All except 4 teams used the BioCreative evaluation library • Only 5 teams used some other additional training data (RSC and • legacy dictionaries, DDI corpus, ChEBI patent corpus) • Lexical resources used: all except 4 used them (included: PubChem, English dictionary as a negative lexicon, Jochem, ChEBI, DrugBank, CTD, UMLS, MeSH, Wikipedia, ChemSpider, HMDB, GPoSTTL) • Lexical resource expansion (for synonyms/aliases): 10 teams • Recognition of other entities: 9 teams (most Genes or proteins or 86 Generic named entities)


Martin Krallinger

• Integration of previous chemical NER systems: 15 teams • Used taggers: ChemSpot, Oscar4, LeadMine, ProMiner, Peregrine, CheNER, ICE, OPSIN, ChemAxon, MetaMap, MiniChem/Drug Tagger, PubTator, ChemicalTagger • 21 teams used some sort of machine learning algorithm: Conditional Random Fields (19), Support Vector Machines, Brown clustering, word embedding induction, Logistic Regression, Max. Entropy, Random Forests • 5 teams provide already a software, 11 teams would be able to provide a software, 10 stated that the would maybe be able to provide it. • 23 would participate again,3 would maybe participate again

87


Martin Krallinger

Using the framework offered by the CHEMDNER task as a way to: • Improve/demonstrate the performance of their software • Push the state of the art for this topic • Test their tools on a different corpus • Adapt their system to detect (new) types of chemical entities as defined in the CHEMDNER corpus • Be able compare their system to other strategies • Improve the scalability of their system • To explore the integration of different NER system for this task


Martin Krallinger

• First time this task was posed: considerable participation • A comprehensive set of annotation rules developed for this task • Large enough corpus constructed to enable training and testing of systems • Obtained results are quite competitive but could even be improved slightly: combined systems & examination of individual methods • Abstracts contain a valuable source of chemical information to be exploited • Chemical document indexing not much better than chemical entity mention recognition • A considerable number of systems are or will be available


Martin Krallinger

Text mining applications and the extraction of drug-induced adverse effects


Martin Krallinger

•  Liver: central organ in toxicology, role in metabolic, excretory &synthetic biochemical pathways •  Direct/indirect toxicity, hypersensitivity, idiosyncratic reactions •  Challenging to predict using standard chemoinformatics •  Drug approval: one of the major causes for drug attrition •  Information in scientific literature, industry reports, institutional reports (EPAR), NDA, patents, clinical records, etc. •  Fourches et al 2010 Chem Res Toxicol: Drug Induced Liver Injuries, extracted and curated proto-assertions composed of concept_relationship_concept triples (BioWisdom Sophia)


•  Complement the information extracted from companies legacy records with data extracted from public documents.

Public datasets

EPAR

Martin Krallinger

Directly imported

Text mining

ChOX

NDA

•  Design novel methodologies to extract toxicology information from documents

Ontologies

Literature CROs

Vitic

•  Develop a system to automatize the extraction process

Legacy reports

•  Link the data extracted with other existent resources •  Prepare the data extracted to being stored in the ChOX database


Data Source

Number of documents

PubMed abstracts

Entire collection

Fulltexts1

13,234

EPAR

2,145

NDA

7,738

Martin Krallinger


Martin Krallinger


Martin Krallinger

Martin Krallinger


Jochem

Oscar4

ChemicalTagger

1,261,644

1,261,644

1,261,644

Mentions

720,387

200,2884

1,414,761

Unique

24,067

143,066

143,656

Filtered

22,692

138,267

139,171

3,361 (14.81%)

-

-

Sentences

SMILES (Sdfiles)


Martin Krallinger


Martin Krallinger


Martin Krallinger

•  Three types of compound/drug indexing: •  (a) Dictionary-based (fast look-up Jochem lexicon) •  (b) Named entity recognition (NER, based on ML: Chemspot,OSCAR4) •  (c) Meta-data (manually indexed MeSH substance terms) •  Additional dictionary pruning •  Implementation of ChemNER (CRF-based on Kolarik corpus) •  Issues with name to structure conversion: ACD, CambridgeSoft (PerkinElmer) and LexiChem (Open Eye) •  CambridgeSoft: Name2Structure batch mode (evaluated several options) • Comment: Text mining community demands a kind of community competition for chemical NER (sort of CHEMDNER)


-  If molecule various framents the mayor ir kept - Standardization of chirality and charge - Remove those with incorrect isotopic information -  Acids are protonated -  Bases are deprotonated

Martin Krallinger

Collab. with Obdulia Rabal (Universidad de Navarra)


Martin Krallinger


Martin Krallinger

Martin Krallinger


Name to structure Conversion: canonical SMILE (& SDF)

Abstracts

Full text

EPAR

116,352

63,391

2,230

Mentions

163,223

129,122

7,525

Abstracts

Full text

EPAR

Ment_Look

146,224

61,822

3,369

25,179 (44.33%)

8,211 (44.55%)

507 (20.75%)

Ment_Rule

16,999

58,612

4,156

Unique

1,997

1,459

125

Sent_CC

36,145

14,013

1,188

Sentences

Ment_CC Unique_CC

56,801 5,420

18,432 1,982

CHID = ChemIDplus. CHEB = ChEBI CAS = CAS number PUBC = PubChem compound INCH = InChI string DRUG = DrugBank

Inhibition

Substrate

Inducer

58,560 73 trigger

87,094 378 trigger

72,169 82 trigger

2,443 335

CYPs: to UniProt accession & CYPs nomenclature codes

Overall combined (lookup,rule,all sets): 299,869


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger

- Lexicon of 243,667 mention variants/aliases (disambiguation of ambiguous acronym mentions)

Martin Krallinger


Sentences Mentions

Abstracts

Full text

EPAR

430,876

65,190

2,671

592,588

780,630

4,285

Ment_Look

-

-

-

Ment_Rule

-

-

-

644

280

93

Sent_CC

99,220

13,470

754

Ment_CC

138,404

18,034

1,059

Unique_CC

13,648

3,193

327

Unique

Name to structure Conversion: canonical SMILE (& SDF) Abstracts

Full text

EPAR

57,456 (41.51%)

5,699 (31.60%)

103 (9.73%)

CHID = ChemIDplus. CHEB = ChEBI CAS = CAS number PUBC = PubChem compound INCH = InChI string DRUG = DrugBank

Increase

Decrease

Change

198,572

163,859

65,237

Combined Riltered: 674,936


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger

114


Martin Krallinger

General aspects of Tebacten Identification of bacterial metabolism relevant articles Detection of the bio-entities involved in biochemical reactions: enzymes, compounds and organisms/species. Extraction of weighted (ranked) relationships between these bio-entities. An interface to browse this information Construct a manually curated database of metabolic reactions from literature. The option to normalize/ground bio-entity mentions to other knowledgebases like UniProt and ChEBI.


Main relation types

Martin Krallinger


Martin Krallinger


Martin Krallinger

Structuring the Biodegradation bibliome using taxonomies


Martin Krallinger

Recognition of bacterial gene and protein mentions Integrated: -Machine learning NER -Dictionary lookup (UniProt) -EC code (Pattern) -Gene Symbol (Rule-based) Should also add metadata from PubMed: enzymes


Extraction patterns

Martin Krallinger


Extraction rules

Martin Krallinger


Pseudomonas putida KT2440 bibliome

Martin Krallinger


http://tebacten.bioinfo.cnio.es/

Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger

126


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger

131 131


Martin Krallinger


Martin Krallinger


Martin Krallinger

Construction of literature derived mutations sets

134


Valencia, A. Florian Leitner Miguel Vazquez

Spanish National Cancer Research Center, Madrid, Spain

Martin Krallinger

Kamalakannan A. R. & Ashish V Tendulkar.

Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India


Motivation behind species tagging/ importance   Species-specific document classification/retrieval   Essential for correct gene/protein mention normalization (linking mention bio-entities to database identifiers)   Biodiversity informatics: create a comprehensive catalogue all all described species together with literature pointers  Associations of pathogens (viruses & bacteria to cancer)   Cervical cancer & human papilloma virus (HPV)   Primary liver cancer &Hepatitis B and C viruses   Lymphomas & Epstein-Barr Virus   T cell leukaemia in adults & the Human T cell leukaemia virus   HPV and oropharyngeal cancer & non melanoma skin cancers   Helicobacter pylori and stomach cancer Harald zur Hausen (Nobel 2008 HPV & cervical cancer)

Martin Krallinger


Martin Krallinger

Text mining for pathogen – cancer type prioritization


Martin Krallinger

Trends in biomedical text mining BIOINFORMATICS

Vol. 00 no. 00 2011 Pages ???

ORNATA: An Organism Name Tagger Based on Conditional Random Fields Kamalakannan A. R.1 Martin Krallinger2 Ashish V Tendulkar1∗ 1 2

Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India Spanish National Cancer Research Center, Madrid, Spain

Received on XXXXX; revised on XXXXX; accepted on XXXXX

Associate Editor: XXXXXXX

ABSTRACT Motivation: There has been an increasing interest in building accurate systems for the task of identifying organism names in biomedical literature. It has been acknowledged that this is essential for advanced text mining tasks, like recognition and disambiguation of biomedical entities such as proteins and genes, and IE systems for extraction of protein-protein interactions. Organism name recognition is directly necessary for species-specific document classification and retrieval. Also, it is of great importance in the emerging field of biodiversity informatics, which aims to create comprehensive databases of all organisms. Results: This work explores a method based on Conditional Random Fields (CRFs) using rich featuresets for organism name recognition. We conducted various experiments for selecting featuresets and labelling schemes appropriate for this particular task. We achieved F1 values of 74.60% on the GENIA corpus and 74.37% on the LINNAEUS corpus. We compared its performance with the LINNAEUS species name identification system and concluded that our method achieves competitive performance with state of the art at a fraction of computational cost. We also present a detailed analysis of the outputs from both of these systems. Availability: The organism name tagger is freely available at http://www.cse.iitm.ac.in/∼ashishvt/research/ornata/ and supported in Linux. It comes with four models trained on two standard corpora. The GUI version can be used for visualization of the tagging outputs, training customized models and analyzing testing results. Contact: [email protected]

1 INTRODUCTION The exponential increase in the volume of available biomedical literature has spurred a lot of interest in developing efficient techniques for biomedical text mining. Information Retrieval (IR) tools help biologists by retrieving relevant articles in response to search queries. Information Extraction (IE) methods can automatically extract information like protein-protein interactions and biomolecular events (??). These help in phenomenally reducing manual effort in curation tasks, like in building databases of proteins, genes, diseases, drugs and biological pathways (????). A basic step in both IE and advanced IR systems is to identify biological entities like proteins, genes and cell lines from text, a task known as Named Entity Recognition (NER) or simply, Entity Recognition (ER). In spite of a lot of attention, NER techniques ∗ to

whom correspondence should be addressed

c Oxford University Press 2011. �

perform far poorer on biomedical texts than on normal texts, like news articles, due a variety of well known factors (?). In this work, we specifically focus on NER of organism names from research articles. It is an important task in text mining pipeline due to numerous reasons. Organism NER can help in recognition and disambiguation of other biomedical entities, like genes and proteins, since they are typically mentioned along with their host species (??). Also, many teams which participated in the BioCreative II challenge for extraction of protein-protein interactions (PPI) and BioNLP shared task for biomolecular event extraction emphasized the importance of recognizing species names (??). Organism name recognition can also help in “taxonomically intelligent” IR to limit searches within the specified taxa (or, even “explode” to include all subordinate taxa). For example, if the word “mammal” is present in search phrase, articles that mention all mammal names like “human” or “mice” can also be retrieved. PathBinderH is a tool that implements this for plant taxa (?). A similar project is uBioRSS, which aims to provide RSS feeds narrowed on the researcher’s taxa of choice (?). In the popular PubMed search, this can be achieved to some extent with the help of a controlled vocabulary called Medical Subject Headings (MeSH). MeSH is a carefully selected list of 25,588 descriptors (including organism names), arranged in a hierarchy, used to manually (or semi-automatically) tag biomedical articles. Entry terms or synonyms (172,000 in total) are also used and searches can be optionally exploded to include subcategories 1 . But the scope is limited since the set of organism names in MeSH is comparatively small, perhaps because of curation overheads. Biodiversity informatics aims to automatically build databases containing information on morphology, distribution and phylogeny of organisms and higher order taxa, to aid in studying the evolution, speciation and diversification of organisms across the world (?). Here, the processing of recently opened up historical archives pose many challenges, by describing organisms which may have been renamed, migrated or become extinct. The Encyclopedia Of Life (EOL) project aims to create an authoritative webpage for each species known to mankind (around 2 million). The indispensability of automatic organism name identification from text in such ambitious efforts is obvious. Also, it will be useful in systems like BioLit, which are aimed at semantic enrichment of articles by marking up entities in them for integration with standard databases (??). 1

http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/

1

Martin Krallinger


Martin Krallinger

Challenges

- Large collection of species, change over time, hierarchical relation types relation types - Homonymy with commonly used English words, e.g.: “Spot” (Leiostomus xanthurus) and “Permit” (Trachinotus falcatus) - Homonymy with other biomedical entities too (the word “goat” can refer to proteins found in human, zebrafish, rat and mouse. - Abbreviations are ambiguous, e.g.: HBV can be used for both “Hepatitis B virus” as well as “Hepatitis B vaccine” - Vernacular form (common names) - Incorrect case or misspelt (like, Bacterium coli, Bacillus coli and Escheria coli for Escherichia coli) -  Coordinations, nested expressions: “human immunodeficiency viruses types 1 and 2”, refer to two distinct species names, “HIV type 1” and “HIV type 2” -  Role names (e.g. athletes, responders)


Martin Krallinger

Types of systems (I)  Dictionary based systems: based on lexicon of organism names (NCBI Taxonomy), e.g. Kappeler et al (2009) as a base for later PPI extraction; LINNEAUS uses NCBI dictionary and stop word filtering plus rule-based post-processing (Gerner et al 2010). (II) Rule-based systems: capture regularities in word like orthographic features. Linnaean conventions encoded into rules. Examples: Taxongrab (rules and lexicon), All Taxon names (FAT) rules and n-gram distributions (Sautter et al 2006); FindIT (uBio) additional rules involving affixes and lexicons. (III) Machine learning based systems: e.g. Wang and Grover 2008 used Maximum Entropy model to tag species name mentions associated to entities of interest; OrganismTagger (SVM build in GATE)


NCBI Taxonomy

Martin Krallinger

Martin Krallinger


Effort to catalogue life / species/ taxonomies Thompson scientific has a list of over 3million names World's Register of Marine Species

Global Names Index database

Catalogue of Life Species 2000 NEWT (UniProt)

zip code zoo has over 1.4 million species names List of Prokaryotic Names with Standing in Nomenclature ITIS Catalogue of Life 2007 Annual Checklist


Martin Krallinger

Ornata overview Conditional Random Field (CRF), a state-of-the-art graphical model, for organism NER Features derived from words mentioning organism, context, semantic and part of speech tagging. NER as a sequence labeling task: sentences are the sequences, words are the tokens and entity classes are the labels Linear chain CRFs (SimpleTagger) in the Java-based package MALLET Basic feature set: vocabulary from training set and 14 binary orthographic features, CONTAINS-GREEK feature to model Greek and a STOPWORD feature Context Features: previous and next words (wt−1 and wt+1) to a word wt, (numerals are another distinct feature) Other features: POS tags, word class, semantic tags


Martin Krallinger


Martin Krallinger


Martin Krallinger

Corpora used GENIA corpus: semantically annotated collection of 2000 MEDLINE abstracts. Has a set of 6,152 words (1.29% of total) were part of organism name mentions.

LINNAEUS corpus: corpus specifically annotated for species names of 100 randomly selected full- text articles from PubMed Central. - Species name mentions annotated with descriptors like misspelt by author, in incorrect case and misspelt due to OCR errors. - Normalized to NCBI Taxonomy IDs. - 19,044 sentences with 504,330 words (26.48 words per sentence). - 4092 words (0.81%) were part of organism names in the filtered tags dataset and 5500 words (1.09%) were so in the all tags dataset.


Martin Krallinger

Experimental setting and evaluation 10-fold cross validation Feature sets used: • Basic feature set • Basic + Context tags • Basic + POS tags • Basic + Word class (long) tags • Basic + Word class (brief) tags • Basic + Semantic tags (GENIA dataset alone) • Basic + Context + POS + Word class (brief) tags • BIO representation scheme with Basic + Context + POS + Word class(brief) tags BIO scheme: labels the words as belonging to the Beginning, the Inside and the Outside of organism names.


Martin Krallinger


Martin Krallinger


Martin Krallinger

Analysis and classification of errors (LINNEAUS corpus) All classification errors of best performing models manually revised Words adjacent to correct classifications. Examples are “Plasmids” in “Plasmids, yeast”, “Guiyu” in “Guiyu patient”, “mitochondria” in “wheat mitochondria” and “electroporate” in “electroporate potato” Ambiguous words: “travelers who fly” and “Child” and “V. B. Vouk” in an author name list. “Rice”, a FP by LINNEAUS system correct by ORNATA. Detected missing annotations in the LINNEAUS corpus: “Bortrytis blight”, “adenovirus”, “murine”, “women” Tagger missed many long organism names like “Pileated Woodpecker Dryocopus pileatus”, “Ivory-billed Woodpecker Campehilus principalis” and “Viral Hemorrhagic Septicemia Virus (VHSV)”. Missed some names in the scientific form like “L.whitmani” and “C.elegans”.


Martin Krallinger

Analysis and classification of errors (GENIA corpus) False positives: words near organism names which shared same orthographic features & context Examples: “HIV pathogenesis”, “EBV lytic origin” “HTLV-1 tat” False negatives: words whose orthographic features wouldn’t have provided any clue (like “sooty mangbeys”, “housedust mites”, “simian virus 40” and “primate lentiviruses”) or would have acted against them (“iHIV-1” as an abbreviation of “heatinactivated HIV-1”.


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger


Martin Krallinger

Generating text annotations

http://myminer.armi.monash.edu.au

157


Martin Krallinger

MyMiner for manual text classification

158


Martin Krallinger

MyMiner for entity tagging

159

CHEMDNER

CENTRO NACIONAL DE IVESTIGACIONES ONCOLOGICAS

Florian Leitner, Miguel Vazquez and Alfonso Valencia, (CNIO, Spain) ! Julen Oyarzabal and Obdulia Rabal (Small Molecule Discovery Platform, Center for Applied Medical Research, University of Navarra, Spain)! David Salgado (Medical Genomics and Functional Genetics Institute at the Aix-Marseille University in Marseille, France)! Ashish V Tendulkar. & Kamalakannan A. R. &Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India