Quality Assurance in Biomedical Terminologies ... - Semantic Scholar

4 downloads 471 Views 4MB Size Report
Apr 8, 2010 - (SBDC). Branded drug. (SBD). Clinical drug. (SCD) ingredient_of ingredient_of ingredient_of has. _ingredient has. _ingredient has. _ingredient.
Board of Scientific Counselors April 8, 2010

Quality Assurance in Biomedical Terminologies and Ontologies

Olivier Bodenreider, MD, PhD Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

Outline  Analytical

framework for quality assurance

research  Overview of 36 auditing studies  Detailed presentation of 4 studies

Lister Hill National Center for Biomedical Communications

2

Medical Ontology Research  Objective:

To develop methods whereby ontologies can be acquired from existing resources and validated against other knowledge sources, including the Unified Medical Language System (UMLS)  Quality assurance  

In and of itself As part of other projects (e.g., alignment)

Lister Hill National Center for Biomedical Communications

3

Motivation  To

improve the quality of biomedical terminologies and ontologies  Approach   

Principled Automated Scalable

 Report

to the developers  Share our methods

Lister Hill National Center for Biomedical Communications

4

Ontology vs. other artifacts  Ontology 

Defining types of things and their relations

 Terminology 

Naming things in a domain

 Thesaurus 

Organizing things for a given purpose

 Classification 

Placing things into (arbitrary) classes

 Knowledge 

bases

Assertional vs. definitional knowledge Lister Hill National Center for Biomedical Communications

5

Ontology vs. other artifacts (revisited)  Lexical  

and terminological resources

Mostly collections of names for biomedical entities Often have some kind or hierarchical organization (e.g., relations)

 Ontological resources 



Mostly collections of relations among biomedical entities Sometimes also collect names

In practice: “Terminologies and Ontologies” Lister Hill National Center for Biomedical Communications

6

Analytical framework for quality assurance research

Analytical framework for QA research  Special

issue of JBI on “Auditing terminologies”  Zhu et al. JBI 2009 review article  Analytical framework   

What is analyzed Which source of knowledge Which method

Zhu, X., J.W. Fan, D.M. Baorto, C. Weng, and J.J. Cimino, A review of auditing methods applied to the content of controlled biomedical terminologies. J Biomed Inform, 2009. 42(3): p. 413-25.

Lister Hill National Center for Biomedical Communications

8

What is analyzed  Term/concept   

Coverage (missing terms/concepts) Wrong synonymy relation Redundant concepts

 Relation  

Missing relations Inaccurate relations

 Categorization 

Wrong categorization Lister Hill National Center for Biomedical Communications

9

Which source of knowledge  Intrinsic –   

the terminology itself

Terms/Concepts Relations Categorization

 Extrinsic – 

Corpus  



external resources

Text corpus – identify terms in text Annotation corpus – identify relations from co-occurring terms

Mapping

Lister Hill National Center for Biomedical Communications

10

Which method Main categories  Lexical 

Properties of the term

 Structural 

Properties of the organizational structure (relations)

 Semantic 

Semantic properties of the concept (semantic type)

Ischemic enteritis Acute Ischemic enteritis Chronic Ischemic enteritis

Modifier

Head

 Statistical 

Associations among entities Lister Hill National Center for Biomedical Communications

11

Which method Main categories  Lexical 

Properties of the term

 Structural 

Properties of the organizational structure (relations)

 Semantic 

Semantic properties of the concept (semantic type)

 Statistical 

Disease

Associations among entities

Endocrine system diseases Adrenal gland diseases Adrenal gland hypofunction

Adrenal cortex diseases

Adrenal cortical hypofunction Addison’s Disease

Lister Hill National Center for Biomedical Communications

12

Which method Main categories  Lexical 

Properties of the term

 Structural 

Properties of the organizational structure (relations)

 Semantic 

Semantic properties of the concept (semantic type)

Semantic Network Body Part, Organ, or Organ Component

location of Disease or Syndrome

Adrenal Cortex

Adrenal Cortical hypofunction

location of Metathesaurus

 Statistical 

Associations among entities Lister Hill National Center for Biomedical Communications

13

Which method Main categories  Lexical 

Properties of the term

 Structural 

Properties of the organizational structure (relations)

gene1 – term1 gene1 – term2 gene2 – term1 gene2 – term2 gene3 – term1 gene3 – term3

 Semantic 

Semantic properties of the concept (semantic type)

 Statistical 

term1

term3

term2

Associations among entities Lister Hill National Center for Biomedical Communications

14

Which method Additional methods  Compliance

with ontological

principles 

Operational definitions

 Comparative 

Comparisons between ontologies (mapping)

“Each concept, except for the root, must have (at least) one parent concept”

 Transformative 

Representation formalism

 Use

in an application

Lister Hill National Center for Biomedical Communications

15

Which method Additional methods  Compliance

with ontological

principles 

Operational definitions

MedDRA

SNOMED CT

 Comparative 

Comparisons between ontologies (mapping)

 Transformative 

Representation formalism

 Use

in an application

Lister Hill National Center for Biomedical Communications

16

Which method Additional methods  Compliance

with ontological

principles 

Operational definitions

 Comparative 

Comparisons between ontologies (mapping)

 Transformative 

Representation formalism

 Use

in an application

Lister Hill National Center for Biomedical Communications

17

Which method Additional methods  Compliance

with ontological

principles 

MAOUSSC: Using UMLS for the description of medical procedures

Operational definitions

 Comparative 

Comparisons between ontologies (mapping)

 Transformative 

Representation formalism

 Use

in an application

Lister Hill National Center for Biomedical Communications

18

Overview of 36 research studies

Overview  1998-2010  36  

research studies on QA 18 as primary focus 18 as an application

 13

different terminologies and ontologies  Using a wide range of methods  Presented

through the analytic framework

Lister Hill National Center for Biomedical Communications

20

Lister Hill National Center for Biomedical Communications

21

Year of publication 7 6 5 4 3

QA

2 1 0

Lister Hill National Center for Biomedical Communications

22

Year of publication 25 20 15 total 10

QA

5

28% 0

Lister Hill National Center for Biomedical Communications

23

Terminology/Ontology audited FMA GALEN Gene Ontology ICD-10 ICD-O LOINC MedDRA NCI Thesaurus RxNorm SNOMED CT UMLS Meta UMLS SemNet WordNet 0

5

10

15

Lister Hill National Center for Biomedical Communications

20 24

What is audited terms/concepts; relations;… terms/concepts; relations terms/concepts sem_types; relations relationships relations; categorization relations 0

2

4

6

8

Lister Hill National Center for Biomedical Communications

10

12 25

Knowledge sources used text corpus terms/concepts; relations;… terms/concepts; relations terms/concepts sem_types; relations relations; concept properties relations; categorization relations mapping; relations corpus of annotations co-occurences 0

2

4

6

8

Lister Hill National Center for Biomedical Communications

10 26

12

Methods used for auditing structural; semantic statistical semantic ontological principles lexical; statistical comparative application 0

1

2

3

4

5

Lister Hill National Center for Biomedical Communications

6 27

7

Four examples of quality assurance research studies

Four studies  Identifying polysemous

concepts in the UMLS  Identifying errors in RxNorm  Non-lexical approaches to identifying relations in the Gene Ontology  Lexical approaches to assessing the consistency of relations in SNOMED

Lister Hill National Center for Biomedical Communications

29

Identifying polysemous concepts in the UMLS Erdogan, H., E. Erdem, and O. Bodenreider, Exploiting UMLS semantics for quality assurance purposes. Medinfo, 2010 (in press).

Motivation  One 

error discovered in the UMLS

Inconsistent semantic group for the parent concept compared to the child concept

 Systematic

investigation

 Semantic

properties of the concepts at both ends of a hierarchical relation

Lister Hill National Center for Biomedical Communications

31

Methods  Check 

Semantic groups

 Pairs 

the semantic consistency

of hierarchically related concepts

2M concepts in the UMLS

 Answer

Set Programming

Lister Hill National Center for Biomedical Communications

32

Results  81,512

inconsistent concepts  Most inconsistencies are not indicative of any errors 

Navigation vs. inference

 Small 



number of “real”errors

Polysemous concepts (2 meanings in the same concept) = false synonymy Lexically-suggested, but not caught at editing time

 Pattern

for wrong synonymy

Lister Hill National Center for Biomedical Communications

33

Example Anatomy

Drug

Membranous layer

Pill

Capsule

Capsule of adrenal gland

Oral capsule

Lister Hill National Center for Biomedical Communications

34

Identifying errors in RxNorm Bodenreider, O. and L.B. Peters, A graph-based approach to auditing RxNorm. Journal of Biomedical Informatics, 2009. 42(3): p. 558-570.

Motivation  Large

terminology  Relies heavily on human editors  High quality  Systematic

evaluation  Exploiting the graph structure

Lister Hill National Center for Biomedical Communications

36

tradename_of

ingredient_of

has_ingredient

has_ingredient

has_ingredient

has_tradename

Brand name (BN)

ingredient_of

ingredient_of

Ingredient (IN)

tradename_of

consists_of

consists_of constitutes

tradename_of

consists_of

Clinical drug (SCD)

has_tradename

Branded drug component (SBDC)

constitutes

constitutes

Clinical drug component (SCDC)

Branded drug (SBD)

inverse_isa

isa

has_ingredient

ingredient_of

has_tradename

Branded drug form (SBDF)

has_ingredient

tradename_of

inverse_isa

Clinical drug form (SCDF)

isa

ingredient_of

has_tradename

Methods  Normalize

multi-ingredient drugs  Define “meaningful” paths between 2 nodes Cetirizine has_ingredient consists_of

constitutes

ingredient_of

consists_of constitutes

tradename_of

Branded drug component (SBDC)

constitutes

Clinical drug (SCD)

tradename_of has_tradename

consists_of

Branded drug (SBD)

Zyrtec 10 MG Oral Tablet

inverse_isa

isa

ingredient_of

Lister Hill National Center for Biomedical Communications

Branded drug form (SBDF)

38

ingredient_of

tradename_of has_tradename

has_ingredient

Clinical drug form (SCDF)

isa

has_ingredient

has_tradename

inverse_isa

Alternate (meaningful) paths are expected to be functionally equivalent

Zyrtec

Brand name (BN)

has_ingredient

Clinical drug component (SCDC)



tradename_of has_tradename

ingredient_of

ingredient_of

Ingredient (IN)

has_ingredient

meaningful paths Cetirizine  Compare alternate paths Cetirizine 10 MG  Instantiate all

Zyrtec

Results  348

inconsistencies identified (April 2008)  Reported to the RxNorm team  215 (62%) fixed (January 2009) Sodium chloride

Sochlor tradename_of

has_ingredient consists_of

constitutes

ingredient_of

has_ingredient

tradename_of has_tradename consists_of constitutes

tradename_of

Branded drug component (SBDC)

constitutes

Clinical drug (SCD)

Brand name (BN)

consists_of

Branded drug (SBD)

Sochlor 0.854 MEQ/ML Ophthalmic Solution

inverse_isa

isa

ingredient_of

Lister Hill National Center for Biomedical Communications

Branded drug form (SBDF)

39

ingredient_of

tradename_of has_tradename

has_ingredient

Clinical drug form (SCDF)

isa

has_ingredient

has_tradename

inverse_isa



missing link Sochlor → Sodium chloride Brand name without a direct relation with an ingredient

Sodium Chloride 0.854 MEQ/ML Clinical drug component (SCDC)

Sochlor ingredient_of



(fixed)

has_tradename

has_ingredient

 Example

ingredient_of

Ingredient (IN)

Non-lexical approaches to identifying relations in the Gene Ontology Bodenreider, O., M. Aubry, and A. Burgun, Non-lexical approaches to identifying associative relations in the Gene Ontology, in Pacific Symposium on Biocomputing 2005, R.B. Altman, et al., Editors. 2005, World Scientific. p. 91-102.

Motivation  Gene 

 

ontology (GO)

Widely used for the functional annotation of gene products in many model organisms 3 separate hierarchies No relations across hierarchies Molecular functions

Cellular components

Biological processes

 Missing relations  Missing annotations

Lister Hill National Center for Biomedical Communications

41

Methods  Lexical

methods had been exploited already  Non-lexical methods   

Vector space model (information retrieval model) Co-occurrence of annotations Association rule mining

Lister Hill National Center for Biomedical Communications

42



Similarity in the vector space model GO terms …

tn g1 g2

g1 g2 … gn

Annotation database

… gn

t1 t2

GO terms

Genes

t1 t2

Genes



tn

Lister Hill National Center for Biomedical Communications

43



Similarity in the vector space model Genes g1 g2



tn

ti

→ →

Sim(ti,tj) = ti · tj

tj

GO terms

Similarity matrix

t1 t2

GO terms

GO terms

t1 t2

… gn



tn

t1 t2 …

Lister Hill National Center for Biomedical Communications tn

44

Results Vector Space Model VSM LEX Mol Fnc-Cell Comp 499 917 Mol Fnc-Biol Proc 3057 2523 Cell Comp-Biol Proc 760 2053 Total 4316 5493 Mol Fnc: Biol Proc:

ice binding response to freezing

Non-lex Lexical 7665 5493 230

Lister Hill National Center for Biomedical Communications

45

Lexical approaches to assessing the consistency of relations in SNOMED Bodenreider, O., A. Burgun, and T.C. Rindflesch, Assessing the consistency of a biomedical terminology through lexical knowledge. International Journal of Medical Informatics, 2002. 67(1-3): p. 85-95.

Motivation  Detect

potential inconsistencies in SNOMED  Based on adjectival modification 

Frequent construct in medical terms Ischemic enteritis Chronic Ischemic enteritis

Chronic

Acute

Ischemic enteritis Chronic Ischemic enteritis

Acute Ischemic enteritis

Lister Hill National Center for Biomedical Communications

47

Methods  Identifying adjectival

modifiers  Adjectives and their contexts  Co-occurrence of modifiers  Generating new modified terms  Relationship among terms associated in a pair (acute, chronic) (unilateral, bilateral) (primary, secondary) (acquired, congenital) Lister Hill National Center for Biomedical Communications

48

Results Acquired/congenital (SNOMED) keratodermia Context term present (31%) X 4% Acquired only (10%)

1% Acquired X

Congenital X

Congenital only (84%)

Both present (5%) acquired keratodermia

congenital keratodermia

Lister Hill National Center for Biomedical Communications

49

Summary  Investigated quality  

assurance in terminologies

Multiple approaches Multiple terminologies

 Identified

a limited number of errors that had defeated the quality assurance mechanisms in place in terminology development systems  Reported to the developers 

Fixed in some cases (FMA, RxNorm)

 Shared

our methods with the scientific community

Lister Hill National Center for Biomedical Communications

50

Future work  Develop

the use of Semantic Web technologies (RDF / SPARQL and OWL) to support quality assurance  Quality assurance “in action” 



Investigate terminologies of clinical usefulness, (e.g., NDF-RT – clinical information about drugs) Evaluate its capacity to support clinical decision

 Improve

the quality of SNOMED CT through our involvement with the IHTSDO Lister Hill National Center for Biomedical Communications

51

Acknowledgments 

NLM      





U. Utah 



Barry Smith Lowell Vizenor Joyce Mitchell

       











Erik van Mulligen Lazlo van den Hoek

PR China 



Stefan Schulz Elena Beisswanger

Netherlands 

Duo Wei

Fleur Mougin Anita Burgun Anand Kumar Christine Golbreich Marc Aubry Genieve Botti Marius Fieschi Pierre Le Beux François Kohler

Germany 

NJIT 

France 

Alexa McCray

U. Buffalo 



Lee Peters Kin Wah Fung Lan Aronson Bill Hole Suresh Srinivasan Tom Rindflesch

Harvard 





Songmao Zhang

Turkey  

Halit Erdogan Esra Erdem

Lister Hill National Center for Biomedical Communications

52

Medical Ontology Research Contact: [email protected] Web: mor.nlm.nih.gov Olivier Bodenreider, MD, PhD Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA