Implementing Meta- 1

2 downloads 0 Views 1MB Size Report
contained in Meta-I regarding the conceptPeutz-,Jeghers. Syndrome. 13. MC: SO: TY: (1] Peutz-Jeghers Syndrome. MlSH/MlH/DO 10580. Disease or Syndrome.
Implementing Meta- 1: The First Version of the UMLS Metathesaurus* Mark Tuttleab, David Sherertza, Mark ErlbaumO, Nels Olson", Stuart Nelsonca "UMLS Group, Lexical Technology, Inc., Alameda, CA bSection on Medical Information Science, University of California, San Francisco cSchool of Medicine, State University of New York, Stony Brook btract: The Unified Medical Language sytem (UMLS) is being designed to provide uniform access to computer-based resources in biomedicine. For the foreseeable future, the foundation of the UMLS will be a metathesaw-us of concepts, synthesized from existing sources, including MeSH, SNOMED, ICD-9-CM, CPr-4, DSM-III and other biomedical nomenclatures and classification systems. In Meta-l, the first version of the Metathesaurus, the synthesis is being implemented using a three-part methodology: 1) Concept names (terms) and Intra-source relationships, such as synonymy, have been extracted from each source, and converted to a homogeneous representation; 2) inter-source lexical matches have been used to combine terms from different sources into Metathesaurus entries; and 3) some 30,000 of these entries, those containing MeSH terms and a selected sample of terms from other domais, will be reviewed by humans, enhanced, and modified, as appropriate. This methodology must eventually support incremental development and an audit trail, and it must preserve relationships added during human review. The 30,000 Meta1 entries will contain in excess of 60,000 biomedical terms, and these terms will participate in more than 100,000 thesaurus relationships. These "normative" relationships wiUl be supplemented by "empirical" relationships computed from certain UMLS resources. The first of the empirical relationships wil be counts of the occurrence and cooccurrence of Meta- 1 concepts in MEDLINE.

Users of the UMLS need not "understand" Meta-l, or its successors, to benefit from it. Users of future versions of Gratefiul MedA6 for Instance, may notice only that a program they are already famliar with has increased power and functionality. But for those who wish to understand how to use Meta- 1 explicitly, there will be at least two approaches. Firt, Meta-1 may be viewed as a large sample of Metathesaurus entries (see Figure 2), the contents of which will be best appreciated by viewing Meta- 1 through specially designed browsers.7 Second, Meta-1 can be thought of in terms of the metdology used to build it 2. Overview - Methodolog l& Content The easiest way to understand both the strengths and weaknesses of Meta- 1 is to understand how it is being constructed. In other words, the implementation methodology determines its content. This methodology reflects, in tum, the objectives of the UMLS as a whole. The UMLS objectives derive from the long experience of the NLM with the development, support, maintenance, enhancement and distribution of machine-readable biomedical resources, and from its view of the particular importance of such efforts in the future. Thus, we begin with a review of the UMLS objectives, the Metathesaurus objectives, Meta- 1 objectives, and conclude with how Meta- 1 will be implemented. The three sets of objectives are discussed further in a parallel

submission.8 Again, the degree to which the implementation methodology will determine the content of Meta-1 is considerable. Except for 'facts" added during human review, aU of Meta- Ifs implicit in the sources used to build it and the automatic procedures used to match the terrns within it Put more operationally, Meta-1 will be exactly the result of applyin set-based automatc procedtures to sources, and human review to each of the resulting entries. Figure 3 displays a summary of these operations. Each step requires human judgnent, but, in all but step 7, the judgments are encoded in procedures which operate on sets of terms or relationships, and often on entire sources. The notion of a fornal context is developed in Tuttle, et al.9, the idea being to distinguish contextual information, explicitly represented in a given source's schema, from the potentially much larger universe of contextual information which might be inferred from that source by a human. The Meta-l Template defines the set of relationships which are explicitly represented in a given Meta- 1 entry, e.g. the indication of which term is the preferred term, what sources contain that preferred term, which other terms are lexical variants10 of the preferred term, which other terms synonyms , or related terms, etc. Precedence is used to favor one source over another, when, for instance, more than one source contains a given concept, and one must arbitrarily assume the preferred

The philosophical mind unites where the pedant parts, he is convinced that in the provinces of both the intellect and the senses ALL THINGS ARE UNKED TOGE7HER, and tn his destrefor synthesis he cannot content himself withfragments. - -Johand Frledrlch von Schfller

(1789)1

Those who work with language know that there is no such thing as a true 'synonym2a

1.

Bacground - Azmroachlg Meta-l

As part of the Unified Medical Language System (UMLS)

initiative,3 the National Library of Medicine (NLM) is undertaking the development of a thesaurus of biomedicine.4

To be called the Metathesaurus, reflecting its relationship to existing biomedical terminologies, it will be the principal representation of "semantic locality"5 in early versions of the UMLS. That is, it will be a place where both humans and computer proams can fftnd different ways of sayiMg the same thing, and ways of say&ig related thilngs, tn the biomedcal domain. (See Figure 1.) A first version of the Metathesaurus, called Meta-1, is scheduled for completion in May of 1990. While the objectives for Meta- 1 are modest relative to those for the Metathesaurus as a whole, it will nonetheless possess fornidable size and status. complexity. It is unlikely, for instance, that it will ever appear in printed (paper) form. Further, it may be subsumed within a year or two by Meta-2, which will be larger and more complicated. *Supported by NLM Contract NOI-LM-8-3512.

0195-4210/89/0000/0483$01.00 X 1989 SCAMC, Inc.

483

eig=1: "What would a user see?' A Conceptual model of the UML.S. Shown above is one view of what a user might "see" when using the UMLS. The UMLS would appear to have from two to four layers. During a routine interaction a user might see only the top and bottom layers; but during more comnplex interactions, all layers would be visible. The top layer would be an tnterface, something which collects input and displays output, and manages the interaction. The next layer below this interface would be the Metathesaurus. If a user and the interface agree on a set of biomedical terms relevant to some inquiry, then the Metathesaurus would not be "visible." Otherwise, interactions with the Metathesaurus would attempt to produce terms recognized by the UMLS and agreeable to the user. However they are selected, these terms can then be used to access the Datse Abstracts. Again. during execution of a routine inquiry (one which produced neither too much, nor too little information), the user would not be aware of the abstracts, but would see the result of retrievals against the the bottom layer directly. Inquiries which proved either too general or too specific, could be modified appropriately by using information available in the abstracts. Initially, these abstracts will be nothing more than counts of occurrences and co-occurrences of Metathesaurus concepts in the various UMLS target sources (DBs). The counts would be an estimate of the amount of infornation relavant to a given concept in a given source. Eventually, the abstracts could include more of what might be regarded as a semantic model of the information in each source, as anchored in the concepts in the Metathesaurus.112 Flgze 2: Abstracts of two Meta-L entries, before hun review. Shown below is some of the information to be contained in Meta-I regarding the conceptPeutz-,Jeghers Syndrome. 13

MC: SO: TY:

(1] Peutz-Jeghers Syndrome MlSH/MlH/DO 10580 Disease or Syndrome

DE: Multiple pigmented (melanin) macules of the skin and mouth Mucosa and multiple polypos3i of the small intestine. LU: Peutz Jegher's Syndrome 11SH/ET/DO 10580 Peutz-Jegher's Syndrome MSH/ET/DO 10580 SY:

Peutz-Jeghers' syndrome Syndrome; peut z-j eghers

SNM/SY/D-5432 ICD/IT/759.6

Periorificial lent iginosis syndrome SHM/PT/D-5432 CX:

11eSH

- Diseases Neonatal Diseases and Abnormalities Hereditary Diseases Neoplastic Syndromes, Hereditary (Peutz-Jeghers Syndrome>

MeopIasms Neoplastic Syndromes, Hereditary (Peutz-Jeghers Syndrome> HeopIasms Polyps Intestinal Polyps Digestive System Diseases Gastrointestinal Diseases Intestinal Diseases Intestinal Polyps (Peutz-Jeghers Syndrome> SNOMED - Disease Axis Multiple System Malformation Syndromes and

Chromosomal Diseases RT:

Al:

Hamartomatous Diseases and Syndromes (Periorificial lentiginosis syndrome> OTHER CONGENITAL HAIIARTOSES

CurIIEDLINE Himf1 484

9, *5

QB: 2 2 2 2 1

complications familial & genetic pathology surgery diagnosis

O:

14 Diseases 3 NeopIasms, MlultipIe Primary 2 Adenocarcinoma 2 Pancreatic Meop1asMs 1 Genital Heoplasms, Female 1 Hemorrhage, Gastrointestinal 1 Intestinal Neoplasms 1 Polyposis Syndrome, Familial 1 Psoriasis 1 Sertoli Cell Tumor 1 Testi.cular Neoplasms 2 Analytical, Diagnostic and Therapeutic Technics and Equipment 1 Duodenoscopy 1 Endoscopy 1 Anatomical Terms 1 Skin 1 Biological Sciences 1 Chromosome Rberrations BF: 1986 45, *38 1983 66, *47 1980 58, *44 1977 78, *56 1972 141, *74t 1966 185, *129 NO: 65; was see under POLYPI (now POLYPS) 196364

MiC:

(2] OTHER CONGENITAL HAIIARTOSES

SO: CX:

ICD/PT/759.6 ICD - Diseases and Injuries Congenital Anomalies Congenital anomalies

RT:

Hamartoma Peutz-Jeghers Syndrome Sturge-Ueber Syndrome The [I] preceding the'Main Concept" (MC) Peut z-Jeghers Syndrome indicates that the latter is a "lst Class" entry. In Meta- 1, all MeSH Main Headings, and about 1,000 other terms, are the names of distinct 1st Class concepts. Conversely, the [2] preceding the (( C) OTHER CONGENITAL HRA1ARTOSES indicates that it is the name of a "2nd Class" concept. 14 All 2nd Class concepts are in Meta- 1 because they are relkled to one or more 1st Class concept(s) (in some authoritative source). The distinction supports the notion of a horizon in Meta- 1, as related terms of related terns are not present in Meta-l unless they are therefor other reasons. The SO field indicates that the concept name comes from MeSH. The TY field contains the semantic type of the concept. An important objective of human review is the corrrection and enhancement of the automatically assigned semantic types, although neither correction nor enhancement (of the type) is required in this case.

The LU (Lexical Variant) field contains two MeSH entry terms (ET), a SNOMED synonym (SY), and an ICD index term (I T). The SNOMED synonym pulls in the SNOMED preferred term (PT) in the SY field.

The CX field contains the source contexts for all the terms just enumerated. In this case there are four MeSH contexts and one SNOMED context. The ICD index (MI term pulls in its ICD parent, which is displayed in the RT field. (This relationship can be inferred from the 2nd Class entry.) R I stands for "Appears In", a field which acts as a surrogate for the database abstracts still under development. The field reveals the number of citations in the current MEDLINE file which are indexed by the term. It also shows that it is listed in 11 I 1 (Mendelian Inheritance in Man, a database of descriptions of inherited diseases). The QB (Qualified By) field shows how MeSH subheadings modified the indexdng reported in the A I field. The contents of the OU (Occurs With) field provide links to other 1st Class concepts related by the MEDLINE citations, i.e. MEDUNE indexers deemed that both the MlC and each tenn in the OU field were (together) important in some article. The OU entries are organized by their MeSH categories, here used as surrogates for the yet-tobe-reviewed semantic types. The BF field shows the citation count for each of the MEDUNE back flles. The NO field shows an earlier classification of the MlC.

The remainder of the example shows the (abstracted) 2nd Class entry referenced by the RT field of the 1st Class entry. The three RTs (Related Terms) shown in the 2nd Class entry are each 1st Class terms which have the 2nd Class 11C as an RT.

Fige3: A sequential view of the Meta-I implementation plan; anl steps except 7 are automated. 1) ASSEMBLE - Express, in a consistent character set, all relevant information from each source. source "formal context," in a uniform syntax. 3) PLACE - Using only intra-source information, put each term In the appropriate slot In the Meta- 1 Template, e.g. distinguish preferred forms, synonyms, and related terms. 4) MATCH - Find all ntra- and biter-source pairs of terms which "match" lexically. Depending on the type of match, some wfll become Lexical Variants (LVs), some wlll become suggested LVs, while others will become candidates for synonymy. 5) MERGE - Combine entries containing matching tenns, e.g. move lexical variants to the "Laeical Variant" slot, and merge different entries sharing preferred terms, lexical variants, or synonyms, using precedence to detenrine the resulting entry's preferred form. I ELC eS teMIS,an F - Etatenre contiin selected additional terms, to yield 30,000 Meta-1 entries. 7) REVIEW - Review and coinplete the 30,000 entries

manually.

8) AUGMENT - Add occurrence and co-occurrence information to each tern for which it is available. 1 5 Each of Steps 1-8 require important judgments, but only in step 7 are individual Meta- 1 records affected by term-at-atime judgments. All other judgments are encoded in automatic procedures which operate on a set-at-a-time basis, often on entire sources, or on all of Meta- 1.

Some of the MeSH tenns have definition (DE) fields.

485

WhM Develop a UMLS? A major part of the cognitive cost of retrieving information stems from problems with biomedical terminology. 16 Even the most skillful user must often txy to "guess" what temis a given interface expects in order to retrieve information relevant to the user's inquiry. Therefore, while the objectives of the UMLS initiative are diverse, a central theme is nomenclature transparency. Users able to describe a subject in some recognized termn7ology should be able to retrieve relevant materfals, however the latter are classed 177he UALS wil attempt to achieve uniformity of access via a ungffcation of terminology and certain retrievalfunctions. The unification will be cumulative and gradual, and it will preserve the division of labor between attempts to unify knowledge and its maintenance by domain experts. The un(fcatlon may be regarded as descriptive and evolutionary, rather than radical 3.

and revolutionary.

4.

Why Build a Metathesaurus?

Given that one wishes to build a Unified Medical Language System, it does not necessarily follow that one needs to build a metathesaurus. The requisite information, e.g. links between terms, could be represented in a variety of forms. Seven reasons for representing the linkages in the form of a metathesaurus are listed below. It wil be useful in its ow right. It will define what the UMLS can "know"e. Users will see a larger "target". Developers wil see a specified logical interface. Applications will see a "closed worldf" of terminology. It will be machine-processible & human-maitainable. It wil foster the tnsfer & additivity of information. In summary, just as the Metathesaurus creates a closed world of terminology for the UMLS, it also creates a closed world of explicitly defined set-based relationships within and among the terms.

5. Three Models 5.1 A User Model At its simplest level, the Metathesaurus is a corpus of information about biomedical concepts. At its next level of detail, it is a corpus of concepts from standard sources, arranged in a standard structure (a thesaurus). It promises homogeneity at the lexical and syntactic levels, and linking at the level of semantics. 5.2. A System Model Figure 1 can be thought of as a very high level system model of the UMLS. Specifically, it shows one view of the first level of sub-modules of the UMLS and how they inter-connect. A system model of the Metathesaurus would display only terms, and the relationships among them. To facilitate maintenance, this model will be normalized In a database, normalization is the process of removing redundancy 18 Thus, a normalized database is one in which each "fact" is represented exactly once. 19 While users tend to dislike such representations, as the redundancy which is removed is exactly that which provides the contextual background for understanding what is being viewed,20 it is the normalization which permits consistent maintenance of the "facts" as new "facts" are acquired. 5.3. An Im1lementation Model The implementation model proposed is quite simple, in principle. Certain facts are inherited from authenticated sources. These can be thought of as axioms, that is, they are outside the purview of the UMLS. Other facts are created via automatic procedures. some of which impose the Metathesaurus semantics (placing terms in slots), and others of which create facts via matching. Still more facts are created during human review, some of which may "undo" source facts, or facts created during matching. One advxntage of this model is that, eventually, the process of implreenting the Metathesaurus becomes indistbinuishablefrom the process of updatirg I.21

6. How Wi Meta-l be Implemented?

6.1 Deve Soc- cific inversion Procedures (See Step 2 in Figure 3.) Source inversion procedures convert Metathesaurus sources into a lexically and syntactically homogeneous form. The "inversion" refers to the extraction of the terms along with any relationships that source encodes for that term. In principle, this process should be straightforward; the schema of each source should represent the information necessary to carry it out the inversion. In fact, biomedical sources are like characters in a Hemingway novel. Schemas are easy to create, but hard to live by. Generally, two things can happen. The sources can exhibit what biologists call emergent complexity, and things become much more complex than expected, or they can violate their own schemas with undocumented "exceptions." 'rypically, the number of such exceptions increases over time Usually, each source does what it is supposed to do well, e.g. in MeSH all Main Headings are distinct, just as the MeSH schema predicts should be the case; but, in almost every other respect, terms in various sources can be found which violate their schemas. In none of these cases is the chief function of the sources impeded. But each violation can wreck havoc with a procedure attempting to manipulate all of some type of term or relationship in a given source.22 Differences in lexical style also pose problems. Tyrpically, the less hierarchical a source is the more its terms are "precoordinated," i.e. the more they are combinations of more primitive concepts, instead of being abstractions of those concepts.23 For example, CPT-4 code 54700 stands for the term Incision and drainage of epididymis, testis and/or scrotal space (eg, abscess or hematoma). Not surprisingly, the inversion procedures possess the greatest degree of sourcespeciftc lexical processing techniques. At its most abstract level, the source inversion process is one of ldentjifyng, and labelng, the "slots"found in each source, and mapping the contents of those slots to Metathesaurus slots automatially. In practice, this is something which has to be done iterattvely, another argument for merging the implementation and maintenance process, so as to accommodate the "discovery" of additional information in sources.

6.2. Pre-Computlng Meta-i One advantage of a normalized implementation of Meta- 1 is that it makes it easier to identify natural "dimensions" and follow them independently. Two dimensions are source inwersion and matchti. Two measures, Important-to-lessbmportant and certain-to-less-certatn, can be used to impose a gradient on this two dimensional space. Given such a gradient, the process of building Meta-1 can be thought of as

"hill-climbing." Initialby, work can proceed more or less independently on source inversion and matching, by focusing on things which are both important and certain. It is certain that terms need to be extracted and homogenized from each source, and that exact or nearly exact matches need to be found between them. Further, since, at this point, not much is known about the context of each term, interpreting the matching results in light of the Meta-1 semantics shouldn't be difficult. In other words, simple rules will indicate how terms should be re-arranged in Meta- 1 slots in light of the match. However, as additional relationships are extracted from the sources, and as matching is made more aggressive, things become much more complicated. A given term may participate in a large number of matches of varying "strength" with other terms of varying "importance." For instance, a "main heading' is more important than an "entry term," regardless of the source. Further, that term may participate in a number of relationships, some from that term's source and some from the portions of Meta- 1 already computed. Resolving these conflicts will have to be done with reference to a general notion of "uphill" which combines both Meta- 1

486

semantics and matching.24 At this point, progress along either the source dimension or the matching dimension is tightly coupled Eventually, feedback on the quality and number of matches become more important, as they will define the appropriate "horizon". For instance, matching two distinct MeSH main headings via one or more intermediate terms from other sources will probably be considered "over the horizon." In parallel with this process, the notion of an audit trail will be developed, and the methodology will become automated to a higher degree, probably using the UNIX.m program make to help decide when certain database queries need re-running. Ideally, this 'make' facility will allow a "backup-and-restart," under certain circumstances, when, for instance, some matching or merging rule produced undesired results.

much more revealing, and for programs the same information can be represented in the form of a normalized database. 14Alert medical coders may notice that both the 2nd and 3rd Editions of ICD-9-CM use, as the description of code 759.6, "Other hamartoses, ...", and not the "Other congenital hamartoses, ..." we found on the ICD tape. Thus, while the notion of authenticated nomenclatures and classification systems is critical to the metathesaurus, practical problems remain. These problems will be addressed by the relevant parties well before the release of Meta- 1. 1 5Also under consideration for addition to the A I flield of Meta- 1, are the occurrences and co-occurrences of the MC in selected patient databases . 16The next largest cognitive burden is the contentindependent idiosyncrasies of each interface. An hypothesis of the UMLS is that the availability of the Metathesaurus will allow the development of interfaces which are both simpler for novices and, at the same time, more powerful in the hands of experts. 17Lindberg & Humphreys (88) observe that this problem may be more subtle than is implied here. For instance, a user attempting to retrieve information about a patient described by a list of relevant fidings may have to convert those findings to one or more tentative diagnoses, the simple reason being that the medical literature is more often organized around diagnoses than around findings. Conceivably, the UMLS could suggest such conversions. 18Date, C.J., An Introduction to Database Systems, VoL I, 4th Ed, Addison-Wesley, 1987. 19A simple definition of a "fact" is one preferred by the late M.S. Blots: "A fact is a relation between two things." Notions of "facts," "things," "relations," and "normalization" are difficult to formalize, but in this context they provide a useful way of thinkig about the Metathesaurus. 20Cole, W.G., "The Recontextualization of Data," unpublished. 21Sperzel, D.W., Tuttle, M.S., "Updating the UMLS Metathesaurus: A Model," submitted to SCAMC '89. 22As with the process of database normalization, it is hard to specify where a schema stops and knowledge begins. For example, some sequences of disease descriptions in Current Medical Infaomation and Terminology (CMIT) have been found to be "circular". Is this a violation of CMIT's schema, or is it the fault of medical knowledge? See Nelson, S.J., et al., "Representing Medical Knowledge in Structured Text: Experience with Current Disease Descriptions," submitted to SCAMC '89. 23See Sherertz, et al., (89) for some examples. 24Note that this does NOT mean that terms must be linked to only one entry for a given type of link. The semantics inherited from the sources and judged valid by reviewers will be represented, even if it means that the relationships could be interpreted as "contradictory."

7. Sum Meta- 1 will be implemented using a large repertoire of techniques and sources few of which are, by themselves, "new." Rather, the challenge is to combine these techniques and sources in a manner which will support the objectives of the UMLS and satisfy the constraints imposed on the notion of a metathesaurus. At the root of these techniques is the creation and exploitation of lexical and syntactic uniformity, and the tolerance and representation of semantic heterogeneity. It is the former which permits straightforward computer manipulation, and the latter which permits the modeling of the natural world.

Acknowledgements: We would not be involved with this project were it not for the pioneering work of the late M.S. Blois, Ph.D.,M.D. who first convinced us of the representational power of medical terminology. We thank Christopher Cherniak, Ph.D. for helpful suggestions regarding cognitive and natural complexity, William Cole, Ph.D. for many suggestions regarding metathesaurus metaphors, and Robert Abarbanel, M.D., Ph.D., for many thoughtful cross-examinations regarding the underlying metathesaurus model.

lReinhold, Walter, "Culture 1.OTm: The HyperMedia Guide to Western Civilization," Cultural Resources, Inc., Scotch Plains, NJ, 1989. 2Urdang, Laurence, "Introduction", The Synonym Finder, J.I. Rodale, ed., Rodale Press, Emmaus,Pa., 1978. 3"Unifled Medical Language System," National Library of Medicine News, 1986, 41(11):1-2, 10-11. 4Humphreys, B.L., & Lindberg, DAB. "Building the Unified Medical Language System," submitted to SCAMC '89. 5Tuttle, et al., 'Toward a Biomedical Thesaurus: Building the Foundation of the UMLS", SCAMC '88, p. 191-5. 6Grateful Med®, User's Guide, Version 4.0, Dept. of Health & Human Services, N.I.H., N.L.M., December, 1988. 7Sherertz, et al., "A HyperCard Implementation of Meta- 1", a demonstration submitted to SCAMC '89. 8Humphreys & Lindberg, op cit. 9Loc. cit., SCAMC '88. 10See Sherertz, et al., "Lexical Mapping in the UMLS Metathesaurus," submitted to SCAMC '89. 1 1Current plans call for the notion of a Database Abstract to be subsumed with the Information Sources Map discussed in Humphreys & Lindberg. The latter map will contain the practical and syntactic information required for a human or a computer to utilize one of the DBs at the bottom of the diagram, as well as the semantic information about what it contained relevant to concepts in the Metathesaurus. 12This diagram was prepared by William Cole, Ph.D., for a presentation at Apple Computer, 1/12/89. 13The format employed is arbitrary. For humans the HyperCard version of the same information (cited above) is

487