Board of Scientific Counselors April 8, 2010
Quality Assurance in Biomedical Terminologies and Ontologies
Olivier Bodenreider, MD, PhD Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA
Outline Analytical
framework for quality assurance
research Overview of 36 auditing studies Detailed presentation of 4 studies
Lister Hill National Center for Biomedical Communications
2
Medical Ontology Research Objective:
To develop methods whereby ontologies can be acquired from existing resources and validated against other knowledge sources, including the Unified Medical Language System (UMLS) Quality assurance
In and of itself As part of other projects (e.g., alignment)
Lister Hill National Center for Biomedical Communications
3
Motivation To
improve the quality of biomedical terminologies and ontologies Approach
Principled Automated Scalable
Report
to the developers Share our methods
Lister Hill National Center for Biomedical Communications
4
Ontology vs. other artifacts Ontology
Defining types of things and their relations
Terminology
Naming things in a domain
Thesaurus
Organizing things for a given purpose
Classification
Placing things into (arbitrary) classes
Knowledge
bases
Assertional vs. definitional knowledge Lister Hill National Center for Biomedical Communications
5
Ontology vs. other artifacts (revisited) Lexical
and terminological resources
Mostly collections of names for biomedical entities Often have some kind or hierarchical organization (e.g., relations)
Ontological resources
Mostly collections of relations among biomedical entities Sometimes also collect names
In practice: “Terminologies and Ontologies” Lister Hill National Center for Biomedical Communications
6
Analytical framework for quality assurance research
Analytical framework for QA research Special
issue of JBI on “Auditing terminologies” Zhu et al. JBI 2009 review article Analytical framework
What is analyzed Which source of knowledge Which method
Zhu, X., J.W. Fan, D.M. Baorto, C. Weng, and J.J. Cimino, A review of auditing methods applied to the content of controlled biomedical terminologies. J Biomed Inform, 2009. 42(3): p. 413-25.
Lister Hill National Center for Biomedical Communications
8
What is analyzed Term/concept
Coverage (missing terms/concepts) Wrong synonymy relation Redundant concepts
Relation
Missing relations Inaccurate relations
Categorization
Wrong categorization Lister Hill National Center for Biomedical Communications
9
Which source of knowledge Intrinsic –
the terminology itself
Terms/Concepts Relations Categorization
Extrinsic –
Corpus
external resources
Text corpus – identify terms in text Annotation corpus – identify relations from co-occurring terms
Mapping
Lister Hill National Center for Biomedical Communications
10
Which method Main categories Lexical
Properties of the term
Structural
Properties of the organizational structure (relations)
Semantic
Semantic properties of the concept (semantic type)
Ischemic enteritis Acute Ischemic enteritis Chronic Ischemic enteritis
Modifier
Head
Statistical
Associations among entities Lister Hill National Center for Biomedical Communications
11
Which method Main categories Lexical
Properties of the term
Structural
Properties of the organizational structure (relations)
Semantic
Semantic properties of the concept (semantic type)
Statistical
Disease
Associations among entities
Endocrine system diseases Adrenal gland diseases Adrenal gland hypofunction
Adrenal cortex diseases
Adrenal cortical hypofunction Addison’s Disease
Lister Hill National Center for Biomedical Communications
12
Which method Main categories Lexical
Properties of the term
Structural
Properties of the organizational structure (relations)
Semantic
Semantic properties of the concept (semantic type)
Semantic Network Body Part, Organ, or Organ Component
location of Disease or Syndrome
Adrenal Cortex
Adrenal Cortical hypofunction
location of Metathesaurus
Statistical
Associations among entities Lister Hill National Center for Biomedical Communications
13
Which method Main categories Lexical
Properties of the term
Structural
Properties of the organizational structure (relations)
gene1 – term1 gene1 – term2 gene2 – term1 gene2 – term2 gene3 – term1 gene3 – term3
Semantic
Semantic properties of the concept (semantic type)
Statistical
term1
term3
term2
Associations among entities Lister Hill National Center for Biomedical Communications
14
Which method Additional methods Compliance
with ontological
principles
Operational definitions
Comparative
Comparisons between ontologies (mapping)
“Each concept, except for the root, must have (at least) one parent concept”
Transformative
Representation formalism
Use
in an application
Lister Hill National Center for Biomedical Communications
15
Which method Additional methods Compliance
with ontological
principles
Operational definitions
MedDRA
SNOMED CT
Comparative
Comparisons between ontologies (mapping)
Transformative
Representation formalism
Use
in an application
Lister Hill National Center for Biomedical Communications
16
Which method Additional methods Compliance
with ontological
principles
Operational definitions
Comparative
Comparisons between ontologies (mapping)
Transformative
Representation formalism
Use
in an application
Lister Hill National Center for Biomedical Communications
17
Which method Additional methods Compliance
with ontological
principles
MAOUSSC: Using UMLS for the description of medical procedures
Operational definitions
Comparative
Comparisons between ontologies (mapping)
Transformative
Representation formalism
Use
in an application
Lister Hill National Center for Biomedical Communications
18
Overview of 36 research studies
Overview 1998-2010 36
research studies on QA 18 as primary focus 18 as an application
13
different terminologies and ontologies Using a wide range of methods Presented
through the analytic framework
Lister Hill National Center for Biomedical Communications
20
Lister Hill National Center for Biomedical Communications
21
Year of publication 7 6 5 4 3
QA
2 1 0
Lister Hill National Center for Biomedical Communications
22
Year of publication 25 20 15 total 10
QA
5
28% 0
Lister Hill National Center for Biomedical Communications
23
Terminology/Ontology audited FMA GALEN Gene Ontology ICD-10 ICD-O LOINC MedDRA NCI Thesaurus RxNorm SNOMED CT UMLS Meta UMLS SemNet WordNet 0
5
10
15
Lister Hill National Center for Biomedical Communications
20 24
What is audited terms/concepts; relations;… terms/concepts; relations terms/concepts sem_types; relations relationships relations; categorization relations 0
2
4
6
8
Lister Hill National Center for Biomedical Communications
10
12 25
Knowledge sources used text corpus terms/concepts; relations;… terms/concepts; relations terms/concepts sem_types; relations relations; concept properties relations; categorization relations mapping; relations corpus of annotations co-occurences 0
2
4
6
8
Lister Hill National Center for Biomedical Communications
10 26
12
Methods used for auditing structural; semantic statistical semantic ontological principles lexical; statistical comparative application 0
1
2
3
4
5
Lister Hill National Center for Biomedical Communications
6 27
7
Four examples of quality assurance research studies
Four studies Identifying polysemous
concepts in the UMLS Identifying errors in RxNorm Non-lexical approaches to identifying relations in the Gene Ontology Lexical approaches to assessing the consistency of relations in SNOMED
Lister Hill National Center for Biomedical Communications
29
Identifying polysemous concepts in the UMLS Erdogan, H., E. Erdem, and O. Bodenreider, Exploiting UMLS semantics for quality assurance purposes. Medinfo, 2010 (in press).
Motivation One
error discovered in the UMLS
Inconsistent semantic group for the parent concept compared to the child concept
Systematic
investigation
Semantic
properties of the concepts at both ends of a hierarchical relation
Lister Hill National Center for Biomedical Communications
31
Methods Check
Semantic groups
Pairs
the semantic consistency
of hierarchically related concepts
2M concepts in the UMLS
Answer
Set Programming
Lister Hill National Center for Biomedical Communications
32
Results 81,512
inconsistent concepts Most inconsistencies are not indicative of any errors
Navigation vs. inference
Small
number of “real”errors
Polysemous concepts (2 meanings in the same concept) = false synonymy Lexically-suggested, but not caught at editing time
Pattern
for wrong synonymy
Lister Hill National Center for Biomedical Communications
33
Example Anatomy
Drug
Membranous layer
Pill
Capsule
Capsule of adrenal gland
Oral capsule
Lister Hill National Center for Biomedical Communications
34
Identifying errors in RxNorm Bodenreider, O. and L.B. Peters, A graph-based approach to auditing RxNorm. Journal of Biomedical Informatics, 2009. 42(3): p. 558-570.
Motivation Large
terminology Relies heavily on human editors High quality Systematic
evaluation Exploiting the graph structure
Lister Hill National Center for Biomedical Communications
36
tradename_of
ingredient_of
has_ingredient
has_ingredient
has_ingredient
has_tradename
Brand name (BN)
ingredient_of
ingredient_of
Ingredient (IN)
tradename_of
consists_of
consists_of constitutes
tradename_of
consists_of
Clinical drug (SCD)
has_tradename
Branded drug component (SBDC)
constitutes
constitutes
Clinical drug component (SCDC)
Branded drug (SBD)
inverse_isa
isa
has_ingredient
ingredient_of
has_tradename
Branded drug form (SBDF)
has_ingredient
tradename_of
inverse_isa
Clinical drug form (SCDF)
isa
ingredient_of
has_tradename
Methods Normalize
multi-ingredient drugs Define “meaningful” paths between 2 nodes Cetirizine has_ingredient consists_of
constitutes
ingredient_of
consists_of constitutes
tradename_of
Branded drug component (SBDC)
constitutes
Clinical drug (SCD)
tradename_of has_tradename
consists_of
Branded drug (SBD)
Zyrtec 10 MG Oral Tablet
inverse_isa
isa
ingredient_of
Lister Hill National Center for Biomedical Communications
Branded drug form (SBDF)
38
ingredient_of
tradename_of has_tradename
has_ingredient
Clinical drug form (SCDF)
isa
has_ingredient
has_tradename
inverse_isa
Alternate (meaningful) paths are expected to be functionally equivalent
Zyrtec
Brand name (BN)
has_ingredient
Clinical drug component (SCDC)
tradename_of has_tradename
ingredient_of
ingredient_of
Ingredient (IN)
has_ingredient
meaningful paths Cetirizine Compare alternate paths Cetirizine 10 MG Instantiate all
Zyrtec
Results 348
inconsistencies identified (April 2008) Reported to the RxNorm team 215 (62%) fixed (January 2009) Sodium chloride
Sochlor tradename_of
has_ingredient consists_of
constitutes
ingredient_of
has_ingredient
tradename_of has_tradename consists_of constitutes
tradename_of
Branded drug component (SBDC)
constitutes
Clinical drug (SCD)
Brand name (BN)
consists_of
Branded drug (SBD)
Sochlor 0.854 MEQ/ML Ophthalmic Solution
inverse_isa
isa
ingredient_of
Lister Hill National Center for Biomedical Communications
Branded drug form (SBDF)
39
ingredient_of
tradename_of has_tradename
has_ingredient
Clinical drug form (SCDF)
isa
has_ingredient
has_tradename
inverse_isa
missing link Sochlor → Sodium chloride Brand name without a direct relation with an ingredient
Sodium Chloride 0.854 MEQ/ML Clinical drug component (SCDC)
Sochlor ingredient_of
(fixed)
has_tradename
has_ingredient
Example
ingredient_of
Ingredient (IN)
Non-lexical approaches to identifying relations in the Gene Ontology Bodenreider, O., M. Aubry, and A. Burgun, Non-lexical approaches to identifying associative relations in the Gene Ontology, in Pacific Symposium on Biocomputing 2005, R.B. Altman, et al., Editors. 2005, World Scientific. p. 91-102.
Motivation Gene
ontology (GO)
Widely used for the functional annotation of gene products in many model organisms 3 separate hierarchies No relations across hierarchies Molecular functions
Cellular components
Biological processes
Missing relations Missing annotations
Lister Hill National Center for Biomedical Communications
41
Methods Lexical
methods had been exploited already Non-lexical methods
Vector space model (information retrieval model) Co-occurrence of annotations Association rule mining
Lister Hill National Center for Biomedical Communications
42
Similarity in the vector space model GO terms …
tn g1 g2
g1 g2 … gn
Annotation database
… gn
t1 t2
GO terms
Genes
t1 t2
Genes
…
tn
Lister Hill National Center for Biomedical Communications
43
Similarity in the vector space model Genes g1 g2
…
tn
ti
→ →
Sim(ti,tj) = ti · tj
tj
GO terms
Similarity matrix
t1 t2
GO terms
GO terms
t1 t2
… gn
…
tn
t1 t2 …
Lister Hill National Center for Biomedical Communications tn
44
Results Vector Space Model VSM LEX Mol Fnc-Cell Comp 499 917 Mol Fnc-Biol Proc 3057 2523 Cell Comp-Biol Proc 760 2053 Total 4316 5493 Mol Fnc: Biol Proc:
ice binding response to freezing
Non-lex Lexical 7665 5493 230
Lister Hill National Center for Biomedical Communications
45
Lexical approaches to assessing the consistency of relations in SNOMED Bodenreider, O., A. Burgun, and T.C. Rindflesch, Assessing the consistency of a biomedical terminology through lexical knowledge. International Journal of Medical Informatics, 2002. 67(1-3): p. 85-95.
Motivation Detect
potential inconsistencies in SNOMED Based on adjectival modification
Frequent construct in medical terms Ischemic enteritis Chronic Ischemic enteritis
Chronic
Acute
Ischemic enteritis Chronic Ischemic enteritis
Acute Ischemic enteritis
Lister Hill National Center for Biomedical Communications
47
Methods Identifying adjectival
modifiers Adjectives and their contexts Co-occurrence of modifiers Generating new modified terms Relationship among terms associated in a pair (acute, chronic) (unilateral, bilateral) (primary, secondary) (acquired, congenital) Lister Hill National Center for Biomedical Communications
48
Results Acquired/congenital (SNOMED) keratodermia Context term present (31%) X 4% Acquired only (10%)
1% Acquired X
Congenital X
Congenital only (84%)
Both present (5%) acquired keratodermia
congenital keratodermia
Lister Hill National Center for Biomedical Communications
49
Summary Investigated quality
assurance in terminologies
Multiple approaches Multiple terminologies
Identified
a limited number of errors that had defeated the quality assurance mechanisms in place in terminology development systems Reported to the developers
Fixed in some cases (FMA, RxNorm)
Shared
our methods with the scientific community
Lister Hill National Center for Biomedical Communications
50
Future work Develop
the use of Semantic Web technologies (RDF / SPARQL and OWL) to support quality assurance Quality assurance “in action”
Investigate terminologies of clinical usefulness, (e.g., NDF-RT – clinical information about drugs) Evaluate its capacity to support clinical decision
Improve
the quality of SNOMED CT through our involvement with the IHTSDO Lister Hill National Center for Biomedical Communications
51
Acknowledgments
NLM
U. Utah
Barry Smith Lowell Vizenor Joyce Mitchell
Erik van Mulligen Lazlo van den Hoek
PR China
Stefan Schulz Elena Beisswanger
Netherlands
Duo Wei
Fleur Mougin Anita Burgun Anand Kumar Christine Golbreich Marc Aubry Genieve Botti Marius Fieschi Pierre Le Beux François Kohler
Germany
NJIT
France
Alexa McCray
U. Buffalo
Lee Peters Kin Wah Fung Lan Aronson Bill Hole Suresh Srinivasan Tom Rindflesch
Harvard
Songmao Zhang
Turkey
Halit Erdogan Esra Erdem
Lister Hill National Center for Biomedical Communications
52
Medical Ontology Research Contact:
[email protected] Web: mor.nlm.nih.gov Olivier Bodenreider, MD, PhD Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA