Nov 18, 2013 ... Text mining and ranking of mutations, proteins for cancer. • Detection or ... Netzel
R, Perez-Iratxeta C, Bork P, Andrade MA. The way we write.
Trends in biomedical text mining
Martin Krallinger, Spanish National Cancer Research Centre, (CNIO), Madrid, Spain
November 18th, 2013 Jornadas MAVIR
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
• Background , introduction, biomedical annotations • BioCreative: Critical assessment of Information Extraction in Biology • Recognition of Chemical compound and Drug mentions • Text mining applications and the extraction of drug-induced adverse effects • Text Mining of pathways and enzymatic reactions • Text mining and ranking of mutations, proteins for cancer • Detection or organism and species in text • MyMiner tool for building biomedical text annotations
Martin Krallinger
Trends in biomedical text mining
Unstructured Text (implicit knowledge)
Information Retrieval
Knowledge Discovery
Biomedical Databases
Structured database records, ontologies and controlled vocabullaries
Structured content (explicit knowledge)
Advanced Information Retrieval
Named Entity Recognition Information extraction
Semantic metadata
Adapted from : Text Mining for Biomedicine: Techniques & tools, Sophia Ananiadou, Chikashi Nobata,Yutaka Sasaki, Yoshimasa Tsuruoka
3
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Characteristics of biomedical language Heavy use of domain specific terminology (12% biochemistry related technical terms*), examples: chemoattractant, fibroblasts, angiogenesis
Polysemic words (term homonymy), examples: APC: (1) Argon Plasma Coagulation (2) Activated Protein C; or teashirt: (1) a type of cloth (2) a gene name (tsh).
Heavy use of acronyms, examples: Activated protein C (APC) , or vascular endothelial growth factor (VEGF)
Most words with low frequency (data sparseness) 6 Netzel R, Perez-Iratxeta C, Bork P, Andrade MA. The way we write. EMBO Rep. 2003 May;4(5):446-51
Trends in biomedical text mining
Martin Krallinger
Ambiguity and variability New names & terms created (ad hoc), they change, example: ‘This disorder maps to chromosome 7q11-21, and this locus was named CLAM. ‘[PMID:12771259 ]
Lexical & typographical variants (e.g. in writing gene names), example: TNF-alpha and TNF alpha (without hyphen) Term synonymy: different words with same meaning Different writing styles (native languages): syntactic and semantic and word usage implications. Heavy use of referring expressions (anaphora, cataphora and ellipsis) and inference, example: Glycogenin is a glycosyltransferase. It functions as the autocatalytic initiator for the synthesis of glycogen 7 in eukaryotic organisms.
Trends in biomedical text mining
Biomedical text mining: historical view
Martin Krallinger
Martin Krallinger
Trends in biomedical text mining
Biomedical text mining applications Acronym & term extraction
Interactions
Function Disease
Groups, lists
Entity recognition
Sequence
Krallinger M, Valencia A, Hirschman L. Linking Krallinger,M. Leitner,genes F. andto Valencia,A. Analysis of biological processes and diseases using text mining literature: text mining, information extraction, and retrieval approaches. Bioinformatics Methods in Clinical Research. Methods in MolecularBiology 593, To appear applications for biology. Genome Biol. 2008;9 Suppl 2:S8.
Retrieval, classification
9
Trends in biomedical text mining
Martin Krallinger
Importance of scientific literature data Life sciences -> generates heterogeneous data types (articles, sequence, structure,..) Natural language used for communicating scientific discoveries. Natural language texts amenable for direct human interpretation Natural language not only in scientific articles, but also patents, reports, newswire, database records, controlled vocabularies (GO terms),… Functional information & annotations directly or indirectly derived from the literature (curation and electronic annotation). Databases are generally only capable of covering a small fraction of the biological " context information that can be encountered in the literature." Contextual information of experimental results (cell line, tissue, conditions). User demands of better information access (beyond keyword searches) Rapid growth of information, manual information extraction not efficient.
10
Trends in biomedical text mining
Martin Krallinger
Biomedical literature & scientific discovery
Define the biological question" Biology Select the actual target being studied" Extract information relevant for experimental set up" Locate relevant resources" Essential to understand and interpret the resulting data" Draw conclusions about new discoveries" Communicated to the scientific community using publications in peer-reviewed journals
Resource for clinical decision support in evidencebased clinical practice Clinics Useful information for diagnostic aids
Drug discovery and target selection Pharma Identifying adverse drug effect Competitive intelligence and knowledge management
Global view of the current research state & monitor trends to ensure optimal resource allocation Funding
Find domain experts for specific topics for the peer-review process & detecting potential cases of plagiarism Publ.
11
Trends in biomedical text mining
Martin Krallinger
Bio-databases annotations
12
Trends in biomedical text mining
Martin Krallinger
Textual data in biological annotations
13
Trends in biomedical text mining
Martin Krallinger
Controlled vocabularies and ontologies
14
Martin Krallinger
Trends in biomedical text mining
Biocuration: manual literature annotations
Bio-entities
Scientific Literature
Controlled vocabularies
Database curator Annotation Databases
15
Trends in biomedical text mining
Martin Krallinger
Ontologies: growing in content and terms
16
Trends in biomedical text mining
Increasing number of ontologies
Martin Krallinger > 130
Formats (OBO, OWL, XML, RDF) (http://www.obofoundry.org)
17
Trends in biomedical text mining
Martin Krallinger
OLS: browsing ontologies
www.ebi.ac.uk/ontology-lookup
Trends in biomedical text mining
Biocuration: main tasks
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Databases using/exploring text-mining help
Text mining for annotation databases: where and when
23 Krallinger, Martin. A Framework for BioCuration Workflows (part II). Available from Nature Precedings (2009)
Trends in biomedical text mining
Biocuration workflows
Provided by Andrew Winter BioGRID database (http://wiki.thebiogrid.org/doku.php/curation_description)
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
28
Trends in biomedical text mining
Martin Krallinger
29
Trends in biomedical text mining
Martin Krallinger
• Compare different methods and strategies • Reproduce performance of systems on common data • Provide useful data collections: Gold Standard data • Explore meaningful evaluation strategies and tools • Determine the state of the art • Monitor improvements in the field • Point out needs of the user community • Promote collaborative efforts Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008;9 Suppl 2:S1.
30
Martin Krallinger
Trends in biomedical text mining
BIO-NLP related BIOINFORMATICS
CASP CAMDA
NLP/IR/IE
BIOCREATIVE BioNLP Shared Task
TREC
DDI challenge
CAPRI
TREC Genomics track
GASP
LLL05 challenge
GAW
MUC
ACE
SEMEVAL
KDD cup 2002
SENSEVAL
i2b2 (medical) PTC JNLPBA shared task
KDD: Knowledge Discovery and Data mining
CASP: Critical assessment of Protein Structure Prediction CAMDA: Critical Assessment of Microarray Data Analysis CAPRI: Critical Assessment of Prediction of Interactions GASP: Genome Annotation Assessment Project GAW: Genome Access Workshop PTC: Predictive Toxicology Challenge
RTE
Adapted from Krallinger et al. Genome Biology 2008 9(Suppl 2):S1
JNLPBA: Joint workshop on Natural Language Processing in Biomedicine TREC: Text Retrieval conference MUC: Message Understanding conference LLL05: Genic interaction extraction challenge RTE: Textual Entailment challenge
Trends in biomedical text mining
Biomedical text mining: historical view
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Critical Assessment of Information Extraction systems in Biology • Community-wide effort, several organizers, different tasks • Evaluation of text mining & information extraction applied to biological biomedical domain. • Increasing nr. of groups working in the area of text mining. • Need of common standards, shared evaluation criteria to enable comparison: Avoid One system = one evaluation data set • Promote development of real applications, tools • Assessment of scientific progress: Monitor improvements • Involve domain experts (end users) and biological database curators and text mining experts Krallinger M, Morgan A, Smith L, Leitner F, Tanabe L, Wilbur J, Hirschman L, Valencia A. Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol. 2008;9 Suppl 2:S1.
33
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
35
Trends in biomedical text mining
BioCreative II
BioCreative I
Gene mention recognition Gene normalization GO annotation
Gene mention recognition Gene normalization Protein interaction
BioCreative II.5
Protein interaction
Martin Krallinger
BioCreative III
Protein interaction Gene normalization User interaction
Trends in biomedical text mining
Martin Krallinger
Evaluation metrics (I) FN false negatives - incorrect negative classification results (type II errors) FP false positives - incorrect positive classification results (type I errors) TN true negatives - correct negative classification results (correct rejection) TP true positives - correct positive classification results (correct hit)
AUC iP/R
Trends in biomedical text mining
Martin Krallinger
Evaluation metrics Accuracy a is the percentage of correctly labeled results over all results
Matthew’s correlation coefficient (MCC) is a balanced measure of a test’s results
Area under the ROC curve (AUC ROC) can evaluate ranked (ordered) results Other metrics: Precision Recall curves, Average Precision (AP), threshold average precision (TAP) and agreement measures
Trends in biomedical text mining
Martin Krallinger
• Coordinated by the National Centre of Biotechnology Information (NCBI) • Inspired by the BioCreative IA task: Detection of Gene mentions • Highest f-score*: 87.21 • 21 participating teams • Many systems used: • Conditional Random Fields (CRF);Support Vector Machines (SVM) • POS tagging, Stemming • Exploited systems include: Mallet and the GENIA tagger • Additional resources: Medline, HUGO and Medpost *f-score: harmonic mean of precision and recall
Trends in biomedical text mining
Martin Krallinger
• Coordinated by MITRE (Lynette Hirschman), similar to BioCreative 1B task • A total of 20 groups submitted predictions (up to three runs/ team) • Normalization: association of entity to database records (sequence) • Extract a list of gene identifiers of genes mentioned in PubMed abstracts • Highest f-score around 0.8 • Normalization reference database: human genes/proteins to EntrezGene • Dictionary look-up
Trends in biomedical text mining
Martin Krallinger
**Krallinger M, Leitner F, Vazquez M, Salgado D, Marcelle C, Tyers M, Valencia A, Chatr-aryamontri A. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience. Database (Oxford). 2012 Mar 21;2012:bas017
Trends in biomedical text mining
Martin Krallinger
**Krallinger M, Leitner F, Vazquez M, Salgado D, Marcelle C, Tyers M, Valencia A, Chatr-aryamontri A. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience. Database (Oxford). 2012 Mar 21;2012:bas017
Trends in biomedical text mining
Martin Krallinger
43
Trends in biomedical text mining
Martin Krallinger
Protein-Protein Interaction (PPI) Specific physical contacts with molecular binding between proteins, both transient as well we stable contacts. PPI information: literature, large scale experiments, bioinformatics predictions Public repositories integrate information from large- and small-scale PPI experiments reported in the scientific literature Pathguide contains information about 325 biological pathway related resources and molecular interaction related resources (pathguide.org) Annotation effort shared by various interaction databases: BioGRID, MINT, BIND, CORUM, DIP,HAPPI,HPRD,I2D,InnateDB,IntAct,InteroPorc, iRefIndex, iRefWeb, MatrixDB,MIPS, PC, PIMRider Common vocabulary and standards to improve consistency and Efficiency of PPI annotations: PSI-MI
Trends in biomedical text mining
Martin Krallinger
IMS
45
Martin Krallinger
Trends in biomedical text mining
ACT: Article categorization task • Binary classification of recent PubMed abstracts as PPI relevant • Predictions provided together with a confidence score in the ]0..1] range • Evaluation based on AUC iP/R (also additional analysis, f-score, accuracy) • NOT balanced set, abstracts, journals of biocuration interest • Exhaustive manual revision by three domain experts and refinement based on database curators of BioGRID and MINT • IAA pairwise percentage agreement between MINT & BioGRID 95%. • Article ID ➠ Class ➠ [Rank ➠] Confidence TRAINING SET (Balanced) total size: 2280
DEVELOMENT SET (Unbalanced) total size: 4000
TEST SET (Unbalanced) total size: 6000
+ PPI: 1140 Not PPI: 1140 proportion: 50%
+ PPI: 682 Not PPI: 3318 proportion: 17.05%
+ PPI: 910 Not PPI: 5090 proportion: 15.17%
Example system (specialized): PIE
http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/PIE/
Example system (specialized): PIE
http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/PIE/
Trends in biomedical text mining
Martin Krallinger
IMS OVERVIEW Articles
PSI-MI 2.5 Ontology
PMID:16317001
MI:0017
Martin Krallinger
Trends in biomedical text mining phosphatase assay light scattering filter binding cosedimentation in solution protease assay peptide array fluorescence technology enzymatic study cross-linking study confocal microscopy enzyme linked immunosorbent assay far western blotting cosedimentation through density gradient affinity chromatography technology isothermal titration calorimetry competition binding nuclear magnetic resonance tandem affinity purification fluorescent resonance energy transfer molecular sieving x-ray crystallography protein kinase assay surface plasmon resonance two hybrid anti bait coimmunoprecipitation fluorescence microscopy anti tag coimmunoprecipitation pull down
Test Development Train
0
0,02 0,04 0,06 0,08
0,1
0,12 0,14 0,16 0,18
Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A. Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol. 2008;9 Suppl 2:S4.
0,2
BioCreative II and III: Interaction Method results
51
Trends in biomedical text mining
Martin Krallinger
Krallinger M, Leitner F, Vazquez M, Salgado D, Marcelle C, Tyers M, Valencia A, Chatr-aryamontri A. How to link ontologies and protein-protein interactions to literature: text-mining approaches and the BioCreative experience. Database (Oxford). 2012 Mar 21;2012:bas017
Trends in biomedical text mining
Martin Krallinger
‘Difficult’ cases ['19481529', 'MI:0424', '1', '0.630389', 'phosphorylated Ser437Ala mutant , suggesting phosphorylation of PACS-2 Ser437 was required for binding 14-3-3 proteins . We then conducted a fluorescence polarization assay to determine quantitatively whether phosphorylated'] ['protein kinase assay'] ['18922473', 'MI:0006', '2', '0.472072315860236', 'Interaction between the endogenous TRAF6 and TAK1 in AML12 cells as determined by immunoprecipitation with anti - TAK1 antibody , followed by anti - TRAF6 Western blot . The TGF - \xce\xb2 treatment was for 30 minutes and the total rabbit IgG \n'] ['anti bait coimmunoprecipitation', 'anti bait coip']
Trends in biomedical text mining
BioCreative IV: CHEMDNER task
Martin Krallinger
Trends in biomedical text mining
http://www.biocreative.org/tasks/biocreative-iv/chemdner/
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Krallinger et al. Analysis of biological processes and diseases using text mining approaches. Methods in Molecular Biology 593, 341-382
Trends in biomedical text mining
• PubChem • ChEBI • CHEMBL • Pathway DBs • Molecular interaction DBs • Structure databases • Toxicology/pharmacogenomics DBs • Biochemical/ metabolic pathways DBs • Many others,...
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
• Aim: Identify entity mentions in text • Generic NER: corporate names and places (0.9 Fscore), see Message Understanding Conferences (MUC) • Biology/chemistry NER: more complex (synonyms, disambiguation, typographical variants, official symbols not used,..). • Methods: POS tagging, rule-based, flexible matching, statistics, ML (naïve Bayes, ME, SVM, CRF, HMM). • Important for down-stream text mining: essential step for finding bioentity relevant documents & for automatic extraction of relationships using information extraction (IE) 58
Trends in biomedical text mining
Martin Krallinger
• Gene and Proteins • DNA • RNA • Cell lines • Cell types • Chemical compounds • Mutations, sequence variations, sequences • Species and organisms • Anatomical terms • Disease terms,…. 59
Trends in biomedical text mining
Martin Krallinger
• Authors often do not follow official IUPAC nomenclature guidelines • Chemical compounds/drugs often have many synonyms or aliases (e.g. systematic names, trivial names and abbreviations referring to the same entity). • Existence of hybrid mentions (e.g. mentions that are partial systematic and partial trivial: semi-systematic). • Chemical compounds are ambiguous with respect to other entities like gene/protein names or common English words . • Alternative typographical variants: hyphens, brackets, and spacing. • Alternative word order variants. • The ambiguity of chemical acronyms, short formulae and trivial names. • Identifying new chemical compound names (novel entities). • A set of specialized word tokenization rules required for chemical terms. 60
Trends in biomedical text mining
Main strategy types
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
[CDI] chemical document indexing sub-task: given a set of documents to return for each of them a ranked list of unique chemical entities described within each of these documents. [CEM] chemical entity mention recognition sub-task: systems to provide for a given document the start and end character indices corresponding to all the chemical entities mentioned in this document. Up to five different automatic annotations (runs) for each sub-tasks
1) PMID 2) Mention offset -T: title / A: abstract -Start offset -End offset 3) Rank 4) Confidence score
1) PMID 2) Mention (unique) 3) Rank 4) Confidence score
Trends in biomedical text mining
Martin Krallinger
Main metric: Micro-averaged F-score Automated predictions against manual annotations (Gold Standard) Exact match evaluation FN false negatives - incorrect negative classification results (type II errors) FP false positives - incorrect positive classification results (type I errors) TN true negatives - correct negative classification results (correct rejection) TP true positives - correct positive classification results (correct hit)
Trends in biomedical text mining
Martin Krallinger
(1) Corpus selection and sampling (2) Annotation guidelines and their corpus-driven refinements, (3) Entity annotation granularity (4) Human annotator expertise and training (5) Annotation tools and interface (6) Annotation consistency, definition of upper & lower performance boundaries to be expected by tools (7) Corpus format and availability
64
Martin Krallinger
Trends in biomedical text mining
SYSTEMATIC
Systematic names of chemical mentions, e.g. IUPAC and IUPAC-like names.
2-Acetoxybenzoic acid; N-(4-hydroxyphenyl)acetamide; 3,5,4'-trihydroxy-trans-stilbene
IDENTIFIERS
DB ids of chemicals: CAS numbers, Company Registry numbers, PubChem, ChEBI, CHEMBL identifiers
501-36-0445154; CHEMBL 504
FORMULA
Molecular formula, SMILES, InChI, InChIKey
CC(=O)Oc1ccccc1C(=O)O InChI=1S/C9H8O4/c1-6(10) 13-8-5-3-2-4-7(8)9(11)12/h2-5H, 1H3,(H,11,12); C9H8O4
TRIVIAL
Trivial, trade (brand), common/generic names of compounds (also International Nonproprietary Name, British Approved Name & United States Adopted Name)
Aspirin Acylpyrin paracetamol acetaminophen Tylenol
ABBREVIATION
Abbreviations compounds
DMSO; GABA
FAMILY
Chemical families can be linked to chemical structure: Plurals of systematic IUPAC names, general formulas, etc.
Iodopyridazines ; Diphenols quinolines; terpenoids; ROH
MULTIPLE
Chemicals that are not continuous string of characters (E.g.: multiple chemicals disrupted by coordinated clauses)
thieno2,3-d and thieno3,2-d fused oxazin-4-ones
&
acronyms
of
chemicals
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
• Define what constitutes a chemical entity mention (CEM) • Mentions of practical relevance based on potential target applications • Mentions that can be linked to chemical structure information • Review public/available annotation guidelines for chemicals: annotation manual by Corbett et al as initial reference • CHEMDNER corpus main modifications: • Only Chemical nouns (and specific adjectives, treated as nouns) are tagged (not reactions, prefixes or enzymes) • Reduction in the number of rules • Rules were grouped as Positive, Negative, Orthography, MultiWords • Multiword rules were very simplified: maybe less precise annotation but less error-prone to human annotation • Additional rules for assignment to CEM classes (similar to Klinger et al) • Iterative process → Guidelines were slightly refined after first sample test annotation • Required refinements detected during training/test set annotations will be incorporated in the future release of the CHEMDNER Corpus plus
Trends in biomedical text mining
Martin Krallinger
- User-friendly web-based curation tool - List of available CEM types - Auto-completion tool - Possibility to add comments - Some post-processing steps done with MyMiner and other scripts
Trends in biomedical text mining
Martin Krallinger
69
Martin Krallinger
Trends in biomedical text mining
10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0
TRAIN
DEVELOPMENT
ABBREVIATION
FAMILY
FORMULA
IDENTIFIER
MULTIPLE
SYSTEMATIC
TRIVIAL
NO CLASS
TEST
70
Trends in biomedical text mining
Martin Krallinger
• Estimate of the lower boundary of the expected recall • Vocabulary transfer, defined as the proportion of entities (without repetition) that appear both in the training/development set as well as in the test corpus: 36.34% (train&dev), 27.77% (train) 27.70% (dev) • Dictionary baseline using train/dev vocabulary: precision:57.22%, recall: 57.00%, F-score of 57.11% • Existing tools (out of the box) F-scores: Chemicaltagger (27.46), OSCAR4 (30.54), Jochem (56.40), MeSH term annotations (11.01)
71
Trends in biomedical text mining
Martin Krallinger
• Inter annotator agreement (IAA) study constitutes a sort of upper boundary for the expected systems performance • Agreement between humans manually labeling the data • Useful for estimating the difficulty of the task and quality of annotations • Inter-annotator agreement based on double annotation of 100 abstracts randomly chosen from the entire dataset. • 91% for matching of the manual annotations (regardless CEM class) and 85.26% for exact matching and same CEM class • Most discrepancies - annotations that were missed by either one of them (FN cases) 72
Trends in biomedical text mining
Martin Krallinger
• Test set annotation using an additional curation team to recover potential FN cases • Main annotator team carried out the labeling of the entire CHEMDNER corpus, another secondary team of curators annotated additionally only the test set • Relied primarily on annotations of main annotator team: higher degree of training & provided active feedback for refinement of annotation guidelines • Conflicting annotations between two teams presented to main curation group for manual revision • 1,185 annotations added to original 24, 671 test set (4.08 %) 505 (2.05%) removed: final test set of 25,351 annotations
73
Martin Krallinger
Trends in biomedical text mining Team id 173 177 179 182
Team leader Zhiyong Lu Tolga Can Daniel Lowe Alexander Klenner
Members 3 2 2 2
184
Rafal Rak
2
185 191
S V Ramanan Ana Usié Chimenos
2 5
192
Hua Xu
5
196 197 198 199 207 214 217 219 222 225 231 233 238 245 259 262 263 265 267
Francisco Couto Sérgio Matos Philippe Thomas Matthias Irmer Karin Verspoor Daniel Bonniot de Ruisselet Li LiShuang Madian Khabsa Saber Ahmad Akhondi Daniel Sanchez-Cisneros Donghong Ji Tsendsuren Munkhdalai Hongfang Liu Slavko Zitnik Shuo Xu Asif Ekbal Masaharu Yoshioka shu ching-yao LiLiShuang
3 3 5 6 3 1 6 1 6 4 4 1 5 1 4 3 2 1 5
Type Academic Academic Commercial Academic
Institution
NCBI/NLM/NIH Middle East Technical University, Ankara, Turkey NextMove Software Fraunhofer-Institute for Algorithms & Sci.Comp. National Centre for Text Mining Academic University of Manchester Commercial RelAgent Pvt Ltd Academic Universitat de Lleida The University of Texas Health Science Center at Academic Houston (UTHealth) Academic LASIGE, University of Lisbon Academic University of Aveiro Academic Humboldt-Universität zu Berlin Commercial OntoChem Academic NICTA (National ICT Australia) Commercial ChemAxon Academic DaLian University of Technology Academic The Pennsylvania State University Academic Erasmus MC, Rotterdam The Netherlands Academic Universidad Carlos III & Univ. Autónoma Madrid Academic Wuhan University Academic Chungbuk National University, South Korea Academic Mayo Clinic Academic University of Ljubljana Academic Institute of Scientific &Technical Information of China Academic IIT Patna, India Academic Hokkaido Univeristy, Sapporo, Japan Academic Yuan Ze University 74 Academic DaLian University of Technology
team_231_cdi_run3 team_184_cdi_run4 team_198_cdi_run4 team_231_cdi_run2 team_198_cdi_run5 team_173_cdi_run4 team_231_cdi_run5 team_173_cdi_run3 team_179_cdi_run2 team_198_cdi_run3 team_233_cdi_run5 team_233_cdi_run1 team_179_cdi_run3 team_197_cdi_run5 team_173_cdi_run5 team_231_cdi_run4 team_197_cdi_run4 team_185_cdi_run2 team_233_cdi_run2 team_245_cdi_run3 team_199_cdi_run1 team_214_cdi_run2 team_222_cdi_run1 Team_207_cdi_run2 team_214_cdi_run4 team_217_cdi_run1 team_217_cdi_run3 team_214_cdi_run1 team_219_cdi_run1 team_265_cdi_run1 team_238_cdi_run3 team_219_cdi_run3 team_219_cdi_run5 Team_207_cdi_run1 team_177_cdi_run2 team_191_cdi_run1 team_238_cdi_run4 team_182_cdi_run1 team_225_cdi_run1 team_225_cdi_run4 team_182_cdi_run3 team_225_cdi_run5 team_219_cdi_run2 team_196_cdi_run4 team_182_cdi_run2 team_196_cdi_run1
Trends in biomedical text mining
• 23 teams (91 runs) • Highest F-score: 88.20 • Highest precision: 98.66 (recall of 16.65) • Highest recall: 92.24 (precision of 76.29)
Martin Krallinger
F-score
1
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
0,1
0
75
CHEMDNER
• P: precision; • R: recall, • F1:F-score • SDs: standard deviation in the bootstrap samples
Trends in biomedical text mining
Martin Krallinger
IAA
Baseline
• 26 teams (106 runs) • Highest F-score: 87.39 • Highest precision: 98.05 (recall of 17.90) • Highest recall: 92.11 (precision of 76.72)
77
Trends in biomedical text mining
• P: precision; • R: recall, • F1:F-score • SDs: standard deviation in the bootstrap samples
Martin Krallinger
Martin Krallinger
Trends in biomedical text mining
Abbreviations
Family
Run
Recall
Run
Recall
T173_5
91.38
T173_5
90.06
T173_3
91.33
T219_2
89.56
T219_2
91.33
T173_3
87.99
T173_4
88.99
T231_5
87.63
T173_1
87.68
T231_3
86.55
79
Martin Krallinger
Trends in biomedical text mining
Formula
Identifier
Run
Recall
Run
Recall
T173_3
89.37
T173_3
90.06
T173_5
88.09
T173_4
89.86
T173_2
83.94
T173_5
87.52
T231_5
83.82
T231_2
86.16
T173_4
83.18
T219_2
84.79
80
Martin Krallinger
Trends in biomedical text mining
Systematic
Trivial
Run
Recall
Run
Recall
T182_2
94.25
T173_3
95.89
T173_5
94.15
T173_5
95.25
T173_3
94.03
T219_2
94.72
T173_4
92.74
T231_5
93.10
T179_2
92.24
T173_4
92.76
81
Trends in biomedical text mining
Multiple
Run
Recall
T173_5
60.30
T231_3
53.27
T173_3
53.27
T231_2
53.27
T231_5
52.76
Martin Krallinger
No class
(Only 41cases)
82
Trends in biomedical text mining
Run
Recall
T173_3
83.49
T173_5
82.43
T219_2
82.05
T173_4
78.52
T173_1
74.69
Martin Krallinger
• Only 108 of 25,351 mentions not detected by any teams (joined Recall = 99.57%): for all mentions not only novel 83
Martin Krallinger
Trends in biomedical text mining 1
2
3
4
5
6
7
8
1- Abbreviations 2- Family 3- Formula 4- Identifier 5- Multiple 6- No class 7- Systematic 8- Trivial
Trends in biomedical text mining
Martin Krallinger
FORMULA: highly ambiguous one/two letter mentions 23537166 23591845 23560542 23375796
A:953:954 A:760:761 A:865:866 A:567:568
I P H O
MULTIPLE: in general more difficult 23414837 23414837 23375209
A:152:199 A:24:74 A:1509:1538
triazolo and imidazo dihydropyrazolopyrimidines amido and benzimidazole dihydropyrazolopyrimidines C4/Ci4, C3, C5 acylcarnitines
TRIVIAL: dyes, special morphology names with brackets 23061466 23223414 23580394 23122138
A:875:889 A:497:510 T:28:41 A:274:286
Guangfu base A anatase (101) Squaraine Dye Sepharose 4B
SYSTEMATIC: very few missed cases, difficult very long mentions 2R,4R- isomer of 2-hydroxy-2-(indol-3-ylmethyl)-4-aminoglutaric acid 1-piperazineacetamide, N-[5-(aminocarbonyl) tricyclo[3.3.1.13,7]dec-2-yl]-α,α-dimethyl-4-[5(trifluoromethyl)-2-pyridinyl]
Trends in biomedical text mining
Martin Krallinger
• 18 teams worked before on chemical entity recognition, 9 teams new • All teams used the provided training/development data sets • All except 4 teams used the BioCreative evaluation library • Only 5 teams used some other additional training data (RSC and • legacy dictionaries, DDI corpus, ChEBI patent corpus) • Lexical resources used: all except 4 used them (included: PubChem, English dictionary as a negative lexicon, Jochem, ChEBI, DrugBank, CTD, UMLS, MeSH, Wikipedia, ChemSpider, HMDB, GPoSTTL) • Lexical resource expansion (for synonyms/aliases): 10 teams • Recognition of other entities: 9 teams (most Genes or proteins or 86 Generic named entities)
Trends in biomedical text mining
Martin Krallinger
• Integration of previous chemical NER systems: 15 teams • Used taggers: ChemSpot, Oscar4, LeadMine, ProMiner, Peregrine, CheNER, ICE, OPSIN, ChemAxon, MetaMap, MiniChem/Drug Tagger, PubTator, ChemicalTagger • 21 teams used some sort of machine learning algorithm: Conditional Random Fields (19), Support Vector Machines, Brown clustering, word embedding induction, Logistic Regression, Max. Entropy, Random Forests • 5 teams provide already a software, 11 teams would be able to provide a software, 10 stated that the would maybe be able to provide it. • 23 would participate again,3 would maybe participate again
87
Trends in biomedical text mining
Martin Krallinger
Using the framework offered by the CHEMDNER task as a way to: • Improve/demonstrate the performance of their software • Push the state of the art for this topic • Test their tools on a different corpus • Adapt their system to detect (new) types of chemical entities as defined in the CHEMDNER corpus • Be able compare their system to other strategies • Improve the scalability of their system • To explore the integration of different NER system for this task
Trends in biomedical text mining
Martin Krallinger
• First time this task was posed: considerable participation • A comprehensive set of annotation rules developed for this task • Large enough corpus constructed to enable training and testing of systems • Obtained results are quite competitive but could even be improved slightly: combined systems & examination of individual methods • Abstracts contain a valuable source of chemical information to be exploited • Chemical document indexing not much better than chemical entity mention recognition • A considerable number of systems are or will be available
Trends in biomedical text mining
Martin Krallinger
Text mining applications and the extraction of drug-induced adverse effects
Trends in biomedical text mining
Martin Krallinger
• Liver: central organ in toxicology, role in metabolic, excretory &synthetic biochemical pathways • Direct/indirect toxicity, hypersensitivity, idiosyncratic reactions • Challenging to predict using standard chemoinformatics • Drug approval: one of the major causes for drug attrition • Information in scientific literature, industry reports, institutional reports (EPAR), NDA, patents, clinical records, etc. • Fourches et al 2010 Chem Res Toxicol: Drug Induced Liver Injuries, extracted and curated proto-assertions composed of concept_relationship_concept triples (BioWisdom Sophia)
Trends in biomedical text mining
• Complement the information extracted from companies legacy records with data extracted from public documents.
Public datasets
EPAR
Martin Krallinger
Directly imported
Text mining
ChOX
NDA
• Design novel methodologies to extract toxicology information from documents
Ontologies
Literature CROs
Vitic
• Develop a system to automatize the extraction process
Legacy reports
• Link the data extracted with other existent resources • Prepare the data extracted to being stored in the ChOX database
Trends in biomedical text mining
Data Source
Number of documents
PubMed abstracts
Entire collection
Fulltexts1
13,234
EPAR
2,145
NDA
7,738
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Martin Krallinger
Trends in biomedical text mining
Jochem
Oscar4
ChemicalTagger
1,261,644
1,261,644
1,261,644
Mentions
720,387
200,2884
1,414,761
Unique
24,067
143,066
143,656
Filtered
22,692
138,267
139,171
3,361 (14.81%)
-
-
Sentences
SMILES (Sdfiles)
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
• Three types of compound/drug indexing: • (a) Dictionary-based (fast look-up Jochem lexicon) • (b) Named entity recognition (NER, based on ML: Chemspot,OSCAR4) • (c) Meta-data (manually indexed MeSH substance terms) • Additional dictionary pruning • Implementation of ChemNER (CRF-based on Kolarik corpus) • Issues with name to structure conversion: ACD, CambridgeSoft (PerkinElmer) and LexiChem (Open Eye) • CambridgeSoft: Name2Structure batch mode (evaluated several options) • Comment: Text mining community demands a kind of community competition for chemical NER (sort of CHEMDNER)
Trends in biomedical text mining
- If molecule various framents the mayor ir kept - Standardization of chirality and charge - Remove those with incorrect isotopic information - Acids are protonated - Bases are deprotonated
Martin Krallinger
Collab. with Obdulia Rabal (Universidad de Navarra)
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Martin Krallinger
Trends in biomedical text mining
Name to structure Conversion: canonical SMILE (& SDF)
Abstracts
Full text
EPAR
116,352
63,391
2,230
Mentions
163,223
129,122
7,525
Abstracts
Full text
EPAR
Ment_Look
146,224
61,822
3,369
25,179 (44.33%)
8,211 (44.55%)
507 (20.75%)
Ment_Rule
16,999
58,612
4,156
Unique
1,997
1,459
125
Sent_CC
36,145
14,013
1,188
Sentences
Ment_CC Unique_CC
56,801 5,420
18,432 1,982
CHID = ChemIDplus. CHEB = ChEBI CAS = CAS number PUBC = PubChem compound INCH = InChI string DRUG = DrugBank
Inhibition
Substrate
Inducer
58,560 73 trigger
87,094 378 trigger
72,169 82 trigger
2,443 335
CYPs: to UniProt accession & CYPs nomenclature codes
Overall combined (lookup,rule,all sets): 299,869
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
- Lexicon of 243,667 mention variants/aliases (disambiguation of ambiguous acronym mentions)
Martin Krallinger
Trends in biomedical text mining
Sentences Mentions
Abstracts
Full text
EPAR
430,876
65,190
2,671
592,588
780,630
4,285
Ment_Look
-
-
-
Ment_Rule
-
-
-
644
280
93
Sent_CC
99,220
13,470
754
Ment_CC
138,404
18,034
1,059
Unique_CC
13,648
3,193
327
Unique
Name to structure Conversion: canonical SMILE (& SDF) Abstracts
Full text
EPAR
57,456 (41.51%)
5,699 (31.60%)
103 (9.73%)
CHID = ChemIDplus. CHEB = ChEBI CAS = CAS number PUBC = PubChem compound INCH = InChI string DRUG = DrugBank
Increase
Decrease
Change
198,572
163,859
65,237
Combined Riltered: 674,936
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
114
Trends in biomedical text mining
Martin Krallinger
General aspects of Tebacten Identification of bacterial metabolism relevant articles Detection of the bio-entities involved in biochemical reactions: enzymes, compounds and organisms/species. Extraction of weighted (ranked) relationships between these bio-entities. An interface to browse this information Construct a manually curated database of metabolic reactions from literature. The option to normalize/ground bio-entity mentions to other knowledgebases like UniProt and ChEBI.
Trends in biomedical text mining
Main relation types
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Structuring the Biodegradation bibliome using taxonomies
Trends in biomedical text mining
Martin Krallinger
Recognition of bacterial gene and protein mentions Integrated: -Machine learning NER -Dictionary lookup (UniProt) -EC code (Pattern) -Gene Symbol (Rule-based) Should also add metadata from PubMed: enzymes
Trends in biomedical text mining
Extraction patterns
Martin Krallinger
Trends in biomedical text mining
Extraction rules
Martin Krallinger
Trends in biomedical text mining
Pseudomonas putida KT2440 bibliome
Martin Krallinger
Trends in biomedical text mining
http://tebacten.bioinfo.cnio.es/
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
126
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
131 131
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Construction of literature derived mutations sets
134
Trends in biomedical text mining
Valencia, A. Florian Leitner Miguel Vazquez
Spanish National Cancer Research Center, Madrid, Spain
Martin Krallinger
Kamalakannan A. R. & Ashish V Tendulkar.
Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India
Trends in biomedical text mining
Motivation behind species tagging/ importance Species-specific document classification/retrieval Essential for correct gene/protein mention normalization (linking mention bio-entities to database identifiers) Biodiversity informatics: create a comprehensive catalogue all all described species together with literature pointers Associations of pathogens (viruses & bacteria to cancer) Cervical cancer & human papilloma virus (HPV) Primary liver cancer &Hepatitis B and C viruses Lymphomas & Epstein-Barr Virus T cell leukaemia in adults & the Human T cell leukaemia virus HPV and oropharyngeal cancer & non melanoma skin cancers Helicobacter pylori and stomach cancer Harald zur Hausen (Nobel 2008 HPV & cervical cancer)
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Text mining for pathogen – cancer type prioritization
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining BIOINFORMATICS
Vol. 00 no. 00 2011 Pages ???
ORNATA: An Organism Name Tagger Based on Conditional Random Fields Kamalakannan A. R.1 Martin Krallinger2 Ashish V Tendulkar1∗ 1 2
Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India Spanish National Cancer Research Center, Madrid, Spain
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Associate Editor: XXXXXXX
ABSTRACT Motivation: There has been an increasing interest in building accurate systems for the task of identifying organism names in biomedical literature. It has been acknowledged that this is essential for advanced text mining tasks, like recognition and disambiguation of biomedical entities such as proteins and genes, and IE systems for extraction of protein-protein interactions. Organism name recognition is directly necessary for species-specific document classification and retrieval. Also, it is of great importance in the emerging field of biodiversity informatics, which aims to create comprehensive databases of all organisms. Results: This work explores a method based on Conditional Random Fields (CRFs) using rich featuresets for organism name recognition. We conducted various experiments for selecting featuresets and labelling schemes appropriate for this particular task. We achieved F1 values of 74.60% on the GENIA corpus and 74.37% on the LINNAEUS corpus. We compared its performance with the LINNAEUS species name identification system and concluded that our method achieves competitive performance with state of the art at a fraction of computational cost. We also present a detailed analysis of the outputs from both of these systems. Availability: The organism name tagger is freely available at http://www.cse.iitm.ac.in/∼ashishvt/research/ornata/ and supported in Linux. It comes with four models trained on two standard corpora. The GUI version can be used for visualization of the tagging outputs, training customized models and analyzing testing results. Contact:
[email protected]
1 INTRODUCTION The exponential increase in the volume of available biomedical literature has spurred a lot of interest in developing efficient techniques for biomedical text mining. Information Retrieval (IR) tools help biologists by retrieving relevant articles in response to search queries. Information Extraction (IE) methods can automatically extract information like protein-protein interactions and biomolecular events (??). These help in phenomenally reducing manual effort in curation tasks, like in building databases of proteins, genes, diseases, drugs and biological pathways (????). A basic step in both IE and advanced IR systems is to identify biological entities like proteins, genes and cell lines from text, a task known as Named Entity Recognition (NER) or simply, Entity Recognition (ER). In spite of a lot of attention, NER techniques ∗ to
whom correspondence should be addressed
c Oxford University Press 2011. �
perform far poorer on biomedical texts than on normal texts, like news articles, due a variety of well known factors (?). In this work, we specifically focus on NER of organism names from research articles. It is an important task in text mining pipeline due to numerous reasons. Organism NER can help in recognition and disambiguation of other biomedical entities, like genes and proteins, since they are typically mentioned along with their host species (??). Also, many teams which participated in the BioCreative II challenge for extraction of protein-protein interactions (PPI) and BioNLP shared task for biomolecular event extraction emphasized the importance of recognizing species names (??). Organism name recognition can also help in “taxonomically intelligent” IR to limit searches within the specified taxa (or, even “explode” to include all subordinate taxa). For example, if the word “mammal” is present in search phrase, articles that mention all mammal names like “human” or “mice” can also be retrieved. PathBinderH is a tool that implements this for plant taxa (?). A similar project is uBioRSS, which aims to provide RSS feeds narrowed on the researcher’s taxa of choice (?). In the popular PubMed search, this can be achieved to some extent with the help of a controlled vocabulary called Medical Subject Headings (MeSH). MeSH is a carefully selected list of 25,588 descriptors (including organism names), arranged in a hierarchy, used to manually (or semi-automatically) tag biomedical articles. Entry terms or synonyms (172,000 in total) are also used and searches can be optionally exploded to include subcategories 1 . But the scope is limited since the set of organism names in MeSH is comparatively small, perhaps because of curation overheads. Biodiversity informatics aims to automatically build databases containing information on morphology, distribution and phylogeny of organisms and higher order taxa, to aid in studying the evolution, speciation and diversification of organisms across the world (?). Here, the processing of recently opened up historical archives pose many challenges, by describing organisms which may have been renamed, migrated or become extinct. The Encyclopedia Of Life (EOL) project aims to create an authoritative webpage for each species known to mankind (around 2 million). The indispensability of automatic organism name identification from text in such ambitious efforts is obvious. Also, it will be useful in systems like BioLit, which are aimed at semantic enrichment of articles by marking up entities in them for integration with standard databases (??). 1
http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/
1
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Challenges
- Large collection of species, change over time, hierarchical relation types relation types - Homonymy with commonly used English words, e.g.: “Spot” (Leiostomus xanthurus) and “Permit” (Trachinotus falcatus) - Homonymy with other biomedical entities too (the word “goat” can refer to proteins found in human, zebrafish, rat and mouse. - Abbreviations are ambiguous, e.g.: HBV can be used for both “Hepatitis B virus” as well as “Hepatitis B vaccine” - Vernacular form (common names) - Incorrect case or misspelt (like, Bacterium coli, Bacillus coli and Escheria coli for Escherichia coli) - Coordinations, nested expressions: “human immunodeficiency viruses types 1 and 2”, refer to two distinct species names, “HIV type 1” and “HIV type 2” - Role names (e.g. athletes, responders)
Trends in biomedical text mining
Martin Krallinger
Types of systems (I) Dictionary based systems: based on lexicon of organism names (NCBI Taxonomy), e.g. Kappeler et al (2009) as a base for later PPI extraction; LINNEAUS uses NCBI dictionary and stop word filtering plus rule-based post-processing (Gerner et al 2010). (II) Rule-based systems: capture regularities in word like orthographic features. Linnaean conventions encoded into rules. Examples: Taxongrab (rules and lexicon), All Taxon names (FAT) rules and n-gram distributions (Sautter et al 2006); FindIT (uBio) additional rules involving affixes and lexicons. (III) Machine learning based systems: e.g. Wang and Grover 2008 used Maximum Entropy model to tag species name mentions associated to entities of interest; OrganismTagger (SVM build in GATE)
Trends in biomedical text mining
NCBI Taxonomy
Martin Krallinger
Martin Krallinger
Trends in biomedical text mining
Effort to catalogue life / species/ taxonomies Thompson scientific has a list of over 3million names World's Register of Marine Species
Global Names Index database
Catalogue of Life Species 2000 NEWT (UniProt)
zip code zoo has over 1.4 million species names List of Prokaryotic Names with Standing in Nomenclature ITIS Catalogue of Life 2007 Annual Checklist
Trends in biomedical text mining
Martin Krallinger
Ornata overview Conditional Random Field (CRF), a state-of-the-art graphical model, for organism NER Features derived from words mentioning organism, context, semantic and part of speech tagging. NER as a sequence labeling task: sentences are the sequences, words are the tokens and entity classes are the labels Linear chain CRFs (SimpleTagger) in the Java-based package MALLET Basic feature set: vocabulary from training set and 14 binary orthographic features, CONTAINS-GREEK feature to model Greek and a STOPWORD feature Context Features: previous and next words (wt−1 and wt+1) to a word wt, (numerals are another distinct feature) Other features: POS tags, word class, semantic tags
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Corpora used GENIA corpus: semantically annotated collection of 2000 MEDLINE abstracts. Has a set of 6,152 words (1.29% of total) were part of organism name mentions.
LINNAEUS corpus: corpus specifically annotated for species names of 100 randomly selected full- text articles from PubMed Central. - Species name mentions annotated with descriptors like misspelt by author, in incorrect case and misspelt due to OCR errors. - Normalized to NCBI Taxonomy IDs. - 19,044 sentences with 504,330 words (26.48 words per sentence). - 4092 words (0.81%) were part of organism names in the filtered tags dataset and 5500 words (1.09%) were so in the all tags dataset.
Trends in biomedical text mining
Martin Krallinger
Experimental setting and evaluation 10-fold cross validation Feature sets used: • Basic feature set • Basic + Context tags • Basic + POS tags • Basic + Word class (long) tags • Basic + Word class (brief) tags • Basic + Semantic tags (GENIA dataset alone) • Basic + Context + POS + Word class (brief) tags • BIO representation scheme with Basic + Context + POS + Word class(brief) tags BIO scheme: labels the words as belonging to the Beginning, the Inside and the Outside of organism names.
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Analysis and classification of errors (LINNEAUS corpus) All classification errors of best performing models manually revised Words adjacent to correct classifications. Examples are “Plasmids” in “Plasmids, yeast”, “Guiyu” in “Guiyu patient”, “mitochondria” in “wheat mitochondria” and “electroporate” in “electroporate potato” Ambiguous words: “travelers who fly” and “Child” and “V. B. Vouk” in an author name list. “Rice”, a FP by LINNEAUS system correct by ORNATA. Detected missing annotations in the LINNEAUS corpus: “Bortrytis blight”, “adenovirus”, “murine”, “women” Tagger missed many long organism names like “Pileated Woodpecker Dryocopus pileatus”, “Ivory-billed Woodpecker Campehilus principalis” and “Viral Hemorrhagic Septicemia Virus (VHSV)”. Missed some names in the scientific form like “L.whitmani” and “C.elegans”.
Trends in biomedical text mining
Martin Krallinger
Analysis and classification of errors (GENIA corpus) False positives: words near organism names which shared same orthographic features & context Examples: “HIV pathogenesis”, “EBV lytic origin” “HTLV-1 tat” False negatives: words whose orthographic features wouldn’t have provided any clue (like “sooty mangbeys”, “housedust mites”, “simian virus 40” and “primate lentiviruses”) or would have acted against them (“iHIV-1” as an abbreviation of “heatinactivated HIV-1”.
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Trends in biomedical text mining
Martin Krallinger
Generating text annotations
http://myminer.armi.monash.edu.au
157
Trends in biomedical text mining
Martin Krallinger
MyMiner for manual text classification
158
Trends in biomedical text mining
Martin Krallinger
MyMiner for entity tagging
159
CHEMDNER
CENTRO NACIONAL DE IVESTIGACIONES ONCOLOGICAS
Florian Leitner, Miguel Vazquez and Alfonso Valencia, (CNIO, Spain) ! Julen Oyarzabal and Obdulia Rabal (Small Molecule Discovery Platform, Center for Applied Medical Research, University of Navarra, Spain)! David Salgado (Medical Genomics and Functional Genetics Institute at the Aix-Marseille University in Marseille, France)! Ashish V Tendulkar. & Kamalakannan A. R. &Department of Computer Science and Engineering, Indian Institute of Technology, Madras, India