Biomedical Database Inter-Connectivity: An Experiment Linking MIM ...

11 downloads 957 Views 671KB Size Report
2 Boeing Computer Services, Seattle, WA. 3 Medical College of Georgia, Augusta, GA. Abstract. The linkage ofdisparate biomedical databases is an important ...
Biomedical Database Inter-Connectivity: An Experiment Linking MIM, GENBANK, and META-1 via MEDLINE W.D. Sperzell, R.M. Abarbanel2, S.J. Nelson3, M.S. Erlbauml, D.D. Sherertz1, M.S. Tuttle1, N.E. Olson', L.F. Fuller1 Lexical Technology Incorporated, Alameda, CA 2 Boeing Computer Services, Seattle, WA 3 Medical College of Georgia, Augusta, GA Abstract The linkage of disparate biomedical databases is an important goal of the Unified Medical Language (UMLS) Project. We conducted an experiment to investigate the feasibility of using UMLS resources to link databases in clinical genetics and molecular biology. References from MIM ("Mendelian Inheritance in Man") were lexically mapped to the equivalent citations in MEDLINE. The MeSH major subject headings by which the citations in a particular MIM entry had been indexed were used to develop a "geneticdisorder-centered view of the world" in Meta-1 (the first official version ofthe UMLS Metathesaurus). Our hypothesis was that these MeSH subject headings could provide access to a "semantic neighborhood" in Meta-l that would be relevant to a particular genetic disorder. By browsing in this "semantic neighborhood," a user could select various combinations of terms with which to search MEDLINE through an interface between Meta-l and Grateful Med. Such searches might retrieve citations that were more recent than those in MIM or that provided useful supplementary information. Since some MEDLINE records contain pointers to entries in GENBANK, information about genetic sequences related to a particular clinical genetic disorder could also be retrieved. This scenario was implemented for a small number of MIM entries, providing a concrete demonstration that linking disparate electronic databases in an important subdomain of biomedicine is relatively straighforward.

Introduction McKusick's "Mendelian Inheritance in Man" (MIM) catalogs approximately five thousand autosomal dominant, autosomal recessive, and X-linked phenotypes [1]. Each catalog entry contains a narrative description of the phenotype and a list of carefully selected references. An electronic version of MIM, known as OMIM ("Online Mendelian Inheritance in Man") is also available. We decided to investigate the hypothesis that (1) the references in OMIM could provide a rich source of links to the National Library of 0195-4210/91/$5.00 ©) 1992 AMIA, Inc.

Medicine's MEDLINE database and (2) these links could be used to access Meta-l concepts relevant to the genetic disorders in OMIM. Methods Although the references in OMIM are stored as ordinary text, rather than being divided into database fields like the references in MEDLINE, almost all of the OMIM references have a fairly standardized formaL This meant that the OMIM references could be treated as formal text objects with a regular substructure. AccorcLngly, an attempt was made to map the OMIM refere.aces to MEDLINE citations using lexical techniques. A similar effort, using different techniques, was undertaken at the National Library of Medicine in 1986. [2] For our experiment, a UNIX* script was used to extract author names, journal_names, and years of publication from the OMIM references. The OMIM identification number for the entry in which the references appeared was also extracted, and a list of the journal names appearing in OMIM was produced. Programs were written in the C language to extract the MEDLINE citation number, author names, journal names, and years of publication from tapes of the MEDLINE current file (citations after the end of 1987) and two MEDLINE backfiles (citations after the end of 1982 and beiore the end of 1987). However, MEDLINE citationis from journals that did not appear in OMIM were not extracted. The data extracted from OMIM and MTEDLINE were loaded into an INGRES database. Cit itions were matched by means of a relational databasZ. query that searched for instances where the primary atuthor, date of publication, and journal name were identical in references from OMIM and MEDLINE. Since the principal goal of this experiment was to provide a demonstration of biomedical database interconnectivity, it was necessary to select a set of UNIX is . registered trademark of AT&T. "

190

ASK/Ingres, Alameda, CA.

MEDLINE citations for this OMIM entry: Fibroblasts Muscles DNA Fatty Acid Desaturases Lipid Metabolism, Inborn Errors Muscular Diseases Carnitine Fatty Acids Cloning, Molecular

examples for further work. Another query was run to identify the phenotypes where more than five references from OMIM were matched to MEDLINE and where the number of references matched was more than 70% of the total number of references in the OMIM entry for that phenotype. From the results of this query, the following phenotypes were arbitrarily chosen to comprise the example set: Acyl CoA Dehydrogenase, Short-Chain, Deficiency of; Niemann-Pick Disease, Type C; Myelopathy, HTLVI-Associated; and Vasoactive Intestinal Polypeptide. The MeSH major subject headings were extracted from each MEDLINE citation that was matched to an OMIM reference in the example set. (These major subject headings were identified by NLM indexers as denoting the principal topics of an article.) For each OMIM entry in the example set, the occurrences of MeSH major subject headings in the MEDLINE citations that had been successfully mapped to OMIM references were counted. Since a particular MeSH major subject heading was likely to be used in more than one of these references, the total number of occurrences of MeSH major subject headings was almost always greater than the number of unique occurrences. In addition, the MEDLNE pointers to GENBANK information were extracted from the MEDLINE citations in the example set. A HyperCard presentation[3] was developed to illustrate the potential use of these computed linkages. A number of useful capabilities were demonstrated with the example set: browsing in a Meta-1 "semantic neighborhood", i.e., looking at concepts of probable relevance to a particular genetic disorder; conducting MEDLINE searches through a previously developed interface to Grateful Med [4]; and accessing relevant genetic sequence information from GENBANK [5].

Variation, Genetics These headings occurred a total of 23 times in the references. Since all MeSH subject headings are core entries in Meta-1, each of the headings listed above provides a link to a Meta-l concept that is in some way relevant to a genetic defciency of Acyl CoA Dehydrogenase, a condition which causes muscle weakness due to defective lipid metabolism. By browsing in this "semantic neighborhood" within Meta-1, a user might discover further information of interest. For example, while Acyl CoA Dehydrogenase is not among the ten MeSH major subject headings listed above, Fatty Acid Desaturases is the only class of enzymes shown. By exploring its Meta-1 entry, the user would learn that Acyl CoA Dehydrogenases is a synonym for Fatty Acid Desaturases in MeSH. Meta-1 concepts are assigned to UMLS semantic categories [6J as shown below for the current example. These may prove to be especially useful in suggesting strategies for browsing or database searches. Anatomy Fibroblasts Muscles Molecular Biology DNA Fatty Acid Desaturases Diseases and Pathologic Processes Lipid Metabolism, Inborn Errors Muscular Diseases Chemicals and Drugs Carnitine Fatty Acid Desaturases Fatty Acids Procedures Cloning, Molecular Concepts and Ideas Variation, Genetics Activities Cloning, Molecular Certain MEDLINE entries contain pointers to GENBANK sequence data that can be exploited as shown in Figure 2. These pointers permitted rapid and accurate retrieval of relevant genetic sequence information.

Results There were 4,332 OMIM references with a year of publication greater than 1987, meaning that they could possibly be found in the MEDLINE current file. Of these, 2,868 (66.3%) were successfully matched. There were 13,595 OMIM references with a year of publication greater than 1982 and less than or equal to 1987, meaning that they could possibly be matched to the two MEDLINE backfiles. Of these, 8,777 (64.6%) were successfully matched. For the sake of brevity, results are presented only for one of the examples, Acyl CoA Dehydrogenase, Short-Chain, Deficiency of. The OMIM entry, translated into the format of the HyperCard application, is shown in Figure 1. The following ten unique MeSH major subject headings were extracted from the Apple Computer, Cupertino, CA

191

ACYL-CoA DEHYDROGENASE, SHORT-CHAIN, DEFICIENCY OF [ACADS DEFICIENCY; LIPID-STORAGE MYOPATHY SECONDARY TO SHORTCHAIN ACYL-CoA DEHYDROGENASE DEFICIENCY; SCADH DEFICIENCY; --.--J Two distinct clinical phenotypes of hereditary short-chain acyl-CoA N dehydrogenase (SCAD; EC 1.3.99.2) deficiency have been identified. A One type has been observed in infants with acute acidosis and muscle R R weakness; the other has been observed in middle-aged patients with A chronic myopathy. SCAD deficiency is generalized in the former type T and localized to skeletal muscles in the latter. Turnbull I et al. (1984) reported the case of a 53-year-old wcman who presented with a lipid-storage myopathy and low concentrations of carnitine in :d-A IT"

GMIM

L: A201 47 0 m

_

_eGenI mmm.0

i

s y M

NI D x

w

m

T E R m

k.; Greene, C.; Sweetman, L.; ; Shih, V.; Lel, L. and Rhead, W. J.: Shortoenzyme A e deficiency: clinical and studies in two 0 -.

Find ... On R Card

in1eH

JL

nkTo

--

Note

Locate R

Figure 1

ACYL-CoA DEHYDROGENASE, SHORT-CHAIN, DEFICIENCY OF [ACADS DEFICIENCY; LIPID-STORAGE MYOPATHY SECONDARY TO SHORTCHAIN ACYL-CoA DEHYDROGENASE DEFICIENCY; SCADH DEFICIENCY; -It GENBANK LOCUS H[MSCAD PRI 15-DEC1829 bp ss-nNA DEFINITION Human short chain acyl-CoA dehydrogenase mrRNA, carplete cds. ACCESSION M26393 KEGYVORDS short chain acyl-Coenzyffe A dehydrogenase. )URCE Human placenta, cDNA to mRNA, clones HS-[1,12]. ORGANISM Hcmo sapiens Eukaryota; Animalia; Metazoa; Chordata; Vertebrata; Mamna Theria; Eutheria; Primates; Haplorhini; Catarrhini; Homin RE) RENCE 1 (bases 1 to 1829) AUTHORS Naito,E., Ozasa,H., Ikeda,Y. and Tanaka,K. TITLE Molecular cloning and nucleotide sequence of ccmplementar encoding human short chain acyl-coenzyme A dehydrogenase study of the molecular basis of human short chain acyl-cc dehydrogenase deficiency I

I

JOURNAL Immm

Find

J.

Clin.

I

...

On

R

C

Note Figure 2

192

ll..

Y~MI *201470

Discussion The effect of mapping references in OMIM to MEDLINE and then extracting the MeSH major subject headings from the MEDLINE citations was to create a "genetic-disorder centered view" of Meta-I corresponding to a phenotype in OMIM. This "view" exploits the property of "semantic locality" in the Metathesaurus [7], where related meanings are linked together. The rationale for such a "genetic-disorder centered view" is that it can help "filter out" the large number of Meta-1 concepts that would not be of interest to a clinical geneticist seeking information about a particular disorder. Thus, it addresses the well-known problem of "drinking from a fire hose," i.e., identifying what may be of genuine interest in a flood of information from innumerable journals and databases[8]. Additionally, "genetic-disorder centered views" of the Metathesaurus may facilitate the development of "profiles of interest" or "fingerprints" of the types of articles that would be of interest to a clinical geneticist. The development of a program which could say that a given article fits such a "profile" might be particularly useful. Although the methods and results of this experiment have been presented in terms of going from OMIM to MEDLINE, Meta-1, and GENBANK, the process can also work in other directions. The links, once established, are bidirectional. For example, let us assume that all the MeSH major subject headings have been extracted for references that match in OMIM and MEDLINE. As a result, many "semantic neighborhoods" within Meta-1 would be defined. A user browsing in Meta-l might visit a particular "semantic neighborhood," find that there is an OMIM entry corresponding to a Meta-1 concept of interest, and then proceed to elicit the relevant genetic sequence information from GENBANK. It should be emphasized that this experiment took textual material that has a regular, almost formal, structure (the OMIM references) and matched it to a database that has a formal schema (MEDLINE). The success rate of approximately 65% was actually quite good, in view of the simple techniques that were used (i.e., searching for exact matches on primary author, date of publication, and journal name). Although no further analysis was performed on the OMIM references that did not match to MEDLINE, inspection of the data suggested that many of these references tended to be from personal communications or journals not indexed by MEDLINE. An important lesson to be gleaned from this experiment is that inter-connectivity is greatly facilitated when links are stored explicitly in a database, as the pointers to GENBANK are stored in MEDLINE. If the OMIM references explicitly stored the MEDLINE citation identification number, then the lexical mapping

193

steps in this experiment would have been completely unnecessary. A program could easily harvest these MEDLINE citation numbers algorithmically and then process the pointers to GENBANK from the appropriate MEDLINE field.

Conclusion This experiment demonstrated an effective method for linking disparate biomedical databases, even when those databases were not conceived with inter-connectivity in mind. The use of bibliographic citation indices made it possible to compute linkages across databases by using simple lexical techniques and a commercial relational database management system. This approach merits further investigation on a larger

scale. References [1]

McKusick VA: Mendelian Inheritance in Man: Catalogs of Autosomal Dominant, Autosomal Recessive, and X-Linked Phenotypes. Ninth Edition. The Johns Hopkins University Press, Baltimore and London, 1990.

[2]

Prettyman M: National Library of Medicine, personal communication. Abarbanel RM, Tuttle MS, Sherertz DD, Olson NE, Fuller LF, Nelson SJ: "Reaching out from Meta-1: Linking to Mendelian Inheritance in Man." In: Mitchell JA (ed), Proceedings of the American Medical Informatics Association, First Annual Educational and Research Conference. AMIA: Washington, DC. June, 1990 p. 64

[3]

[4]

Grateful Med User's Guide, U.S. Department of Health and Human Services, National Institutes of Health, National Library of Medicine.

[5]

GenBank On-line Service User Manual, IntelliGenetics Inc., 700 East El Camino Real, Mountain View, CA.

[6]

McCray AT, Hole WT: "The Scope and Structure of the First Version of the UMLS Semantic Network." In: Miller RA (ed). Proceedings of the Fourteenth Annual Symposium on Computer Applications in Medical Care. Los Alamitos:

[71

[8]

IEEE, 1990:126-30. Nelson SJ, Tuttle MS, Cole WG, Sherertz DD, Sperzek WD, Erlbaum MS, Fuller LF, Olson NE: "From Meaning to Term: Semantic Locality in the ULTILS Metathesaurus." To appear in the proceeings of SCAMC '91. Waldrop MM, "Learning to Drink from a Fire Hose," Science 1990 May 11; 248: 674-5.