Methodology For Creating a Sample Subset of Dynamic Taxonomy to Use in ... email: {dewoller, wenny}@cs.latrobe.edu.au these is ..... snomedct_txt.html, 2001.
Methodology For Creating a Sample Subset of Dynamic Taxonomy to Use in Navigating Medical Text Databases Dennis Wollersheim, Wenny Rahayu Dept. of Computer Science and Computer Engineering La Trobe University, Bundoora, Victoria 3086Australia email: {dewoller, wenny}@cs.latrobe.edu.au Abstract The amount of text available in electronic form is increasing, especially since the rise of the web. So too are the potential interconnections between concepts, given the advent of ontologies and other relationship based data sources. Text could be navigated using the structure from the ontologies, specifically, using dynamic taxonomies to navigate the is-a relationships. Dynamic taxonomies are rooted index structures that dynamically prune themselves in response to zoom requests. The use of dynamic taxonomies with existing ontologies, and in the medical field, is unexplored. This paper details the process of connecting index terms from a medical text database to a taxonomy extracted from an existing medical ontology.
1. Introduction There exist large collections of text. The number and volume of such collections are growing. Additionally, the number of computer usable interconnections between the concepts in the text are growing, due to the development of ontologies. This work explores use of ontologies for information retrieval. An ontology is a set of concepts, linked by relationships. In this paper we use an a particular subset of ontology, a taxonomy. A taxonomy is a set of concepts linked by IS-A type relationships. Text has been traditionally navigated through indexes, table of contents, and browsing. As the text become larger, browsing becomes problematic, and even the other methods become unmanageable; as the text size grows, so too grow the indexes.
1.1. Information retrieval by browsing Browsing, the retrieval of information by recognition rather than recall, has many advantages. The foremost of
these is that human memory is better at recognising items than remembering them without cues [1]. One way to increase recall in a text browsing environment is to increase the number of ways to gain access to a certain bit of data, in other words, to increase the number of entry points into a text. Examples of this include deepening the detail of a table of contents, or increasing the number of index entries. The latter could be readily accomplished by marrying existing index terms with entries from a taxonomy. The problem with this strategy is that the utility of browsing as a strategy declines as the number of terms increase, unless there are subsequent ways of ordering or reducing the term set. Sacco uses dynamic taxonomy (DT) [2] as a solution to this problem of term overload. Firstly, DT provides standard taxonomic ordering of the index term set, allowing complexity to be hidden under higher level terms. Secondly, it provides a zoom operator, which can be used to reduce the term set. The zoom operation functions in the following manner. In response to a request to zoom in on a term, the system dynamically prunes the taxonomy, retaining only the taxonomic elements which categorise the set of text atoms which are also categorised by the zoom term. For example, figure 1 shows a sample taxonomy derived from the medical field. Figure 2 shows the pruned taxonomy after a zoom on the concept amoxycillin. In this example, if a text atom is not classified by amoxycillin, it is not included in the pruned taxonomy. More importantly, any terms in the taxonomy that do not either directly or indirectly classify the remaining text atoms are also pruned. Original DT retrieved from a database of discrete newspaper articles. The present paper describes work based on a medical therapy encyclopaedia, retrieving small chunks of text from a continuous stream of highly interrelated content. Because of complexity and interest, much text processing work has been done in the field of medical text. There are many medical term sets. Where Sacco used a hand crafted purpose-built taxonomy, our taxonomic source
Pathogens Medical concept taxonomy
Drugs
amoxycillin
procaine penicillin
Conditions
vancomycin
Streptococcus pneumoniae
Conjunctivitis
Enterococcus faecalis
Text atoms
Acute cholecystitis
Meningitis
Hospital acquired meningitis
Gonococcal conjunctivitis
Meningitis : Haemophilus influenzae type b
Figure 1 Example taxonomy, and subsequently classified text atoms. Solid lines shows taxonomic is-a links, dotted lines denoted classification by a taxonomic term is extracted from a pre-existing medical ontology, the Unified Medical Language System (UMLS, 2001 version) [3,4,5]. In DT, document index terms were manually assigned. Application of UMLS to a text base would be facilitated if the connection between UMLS and the text base could be computer generated. This paper uses shallow natural language processing techniques to assign concepts from UMLS. This technique has been explored by others (for example, see [6, 7])
Pathogens Medical concept taxonomy
Browsing alone can be cumbersome, as web surfing testifies. In practice, browsing is often combined with a query. For example, when using a reference book, we often look up a primary index word, then scan sub headings. The work in this paper will be used in a system that models this synergy. It will use query to narrow selection, and browse to traverse a DT generated hierarchal semantic categorisation of results. Similarly, in Dynacat [8, 9] medical documents are categorised using UMLS, queries are mapped to the medical
Drugs
amoxycillin
Conditions
procaine penicillin
Conjunctivitis Enterococcus faecalis
Text atoms
Acute cholecystitis
Gonococcal conjunctivitis
Figure 2 Remaining taxonomy after a zoom on the concept amoxycillin. Atoms that are not classified by amoxicillin are pruned; also pruned are the sections of the taxonomic tree that would be unused.
domain, and results browsed taxonomically. Importantly, Dynacat was found to be significantly more useful in satisfying user information requests than either cluster or ranking tools. Dynamic taxonomy improves on Dynacat. It is more flexible, as the taxonomy can be dynamically regenerated, in response to zoom or other queries. The medical field is highly complex, and the amount of available information is increasing rapidly. The democratisation of information brought about by the web has impacted medicine in particular, due to its very personal nature. The audience for this particular material is expanding and becoming less professional, which bodes a demand for alternative access methods.
1.2. Existing Medical Information The specific medical text we will use will be taken from the Dermatology edition of the Therapeutic Guidelines Limited (TGL) series of electronic medical handbooks [10]. As there are 10 such titles, and the amount of information is increasing, TGL seeks to increase information retrieval capability. The use of computer implemented guidelines has been investigated for this purpose, but more fundamental information retrieval techniques were found to be needed [11, 12]. The increased accessibility due to dynamic taxonomy will be well received. For the taxonomy we use an extraction from the UMLS. UMLS is a project of the USA National Institute of Health, originally built in order to unify medical term sets. It consists of two main sections: the Metathesaurus, and the Semantic Network. The Metathesaurus is an ontology of medical knowledge [13]. A key constituent of the metathesaurus is a concept, which serves as nexus of terms across the different term sets. Concepts are interconnected by a set of relationships, derived for the most part from those that linked the concepts in the original term sets. The Semantic Network is a systematic categorisation of the concepts by semantic type. For example, the Metathesaurus concept Dyshidrotic Eczema has the semantic type Disease or Syndrome. It is called a network because it too has structure via a set of relationships. For example, the semantic types Mental or Behaviour Dysfunction (mind) and Neoplastic Process (body) both have IS-A relationships with the semantic type Disease or Syndrome. UMLS is an amalgamation of more than 50 term sets; this breadth makes it strong in the area of synonymy, and useful for natural language based approaches. On the other hand, the term sets were generally created for human controlled indexing activities and not designed to be computer usable; this has led to less than rigorous construction.
This construction process leads to problems. UMLS has greatly varying granularity, is not rigorously defined or systematic, and is often inconsistent within and across term sets [7]. As a result, we must develop strategies to deal with specific problems that arise when extracting a taxonomy. UMLS has 8 major relationship types and 212 subtypes. While these relationships may have been rigorously defined in the documentation of their original term sets, upon amalgamation into UMLS, all the user is left with is the relationship specifier, and the concepts that they describe. This is definition by example. Some of the relationships have a well defined intrinsic meaning (eg. IS-A, PART-OF), while others are quite nonspecific (eg. OTHER). Strictly speaking, the relationships in a taxonomy are IS-A relationships. This means that a concept at a lower level is a more specific type of a concept at a higher level. The problem comes in that, while UMLS does contain information about IS-A relationships between concepts, they are few, comprising less than 2% of total number of relationships. Because of the paucity of strict IS-A relationships, we choose to use less well defined but more numerous set of broader type relationships, that of parent (PAR) and relation-broader (RB). These comprise a set of relationships similar to those specified by IS-A, and more importantly, make up greater than 15% of the total number of relations in UMLS. The variety of term sets that make up UMLS mean that relationships are defined in a variety of ways. The decision to the broaden our inclusion criteria for relationships that will make up our taxonomy means that the relationships we extract are more often internally inconsistent. For example, PAR and RB relations between concepts are often cyclic (e.g. a PAR b and b PAR a); this makes taxonomy problematic! Although UMLS was not rigorously defined, some of its constituent term sets are more comprehensive and rigorous than others. To avoid the problem of cyclical relationships when building our taxonomy, we use only relations from two of the largest term sets: SNOMED (Systematized Nomenclature of Human and Veterinary Medicine), and MESH (Medical Subject Headings). Even with this restricted set of data sources, not all problems are avoided. SNOMED in and of itself has cycles, as can be seen in figure3. UMLS is a potentially valuable resource, but we do not yet know how to use it with DT. More generally, we do not know how it can be used to facilitate medical information retrieval. UMLS is a large dataset, containing more than 700,000 concepts, and 9,000,000 relationships. This size necessitates the creation of a sample subset of document base and connected UMLS concept-relation hierarchy.
Endocrine Diseases
Thyroid Diseases
General and Polyglandular Endocrine Disorders
General and Growth Related Disorders
Figure 3 Cycle of parent links within SNOMED. The arrow points towards the child node.
Raw Materials
Tools
eTG: Dermatology 1 st Ed. Therapeutic Guidelines Ltd
The sample subset will be used 1) as a basis for a prototype DT medical text browser, 2) to judge the veracity of the methodology, and 3) as raw material upon which to tune the DT zoom algorithm. Lessons learned here will be used more widely in further synthesis of ontologies and text databases. The subset was built in the following way. First, medical reference text was taken from the TGL medical therapy encyclopaedia. Noun phrases were then extracted to generate a list of potential index phrases. The phrases were normalised to create a standard form that could then be used to match against UMLS strings which had been similarly normalised. Finally, a taxonomy was grown from the SNOMED and MESH derived relations in UMLS, based on the UMLS concepts found in the previous step. Figure 4 diagrammatically describes the process.
Action Select sample medical text
Example Results 3RPSKRO\[DQDFXWHDQGVHYHUHYHVLF XODU GHUPDWLWLVRIWKHKDQGVRUIHHWRIWHQQHHGV V\VWHPLFF RUWLFRVWHURLGV
Brill Part-ofSpeech Tagger
Generate possible index phrases; mark and extract noun phrases
3RPSKRO\[ DFXWH VHYHUHYHVLFXODUGHUPDWLWLV KDQGV IHHW V\VWHPLFF RUWLFRVWHURLGV
UMLS text normalisation routines
Regularise phrases; stem and place in alphabetic order
SRPSKRO\ [ DFXWH GHUPDWLWLVVHYHUHYHVLFXODU KDQG IRRW FRUWLF RVWHURLGV\VWH PLF
7H UP
UMLS concepts: all termsets
Match phrases to UMLS concepts
UMLS relations: SNOMED & MESH relations only
Grow sample taxonomy from concepts found
SRPSKRO\ [ DFXWH KDQG IRRW IRRW IRRW
80/6FRQFHSWIRXQG
'\VKLGURWLF (F ]HPD $FXWH +DQG 3HV IHHWXQLWRI PHDVXUH PHQW IRRWSUREOHP
6HH)LJXUHIRUH[DPSOHWD[RQR P\JURZQ
Figure 4 Workflow diagram showing processes followed; input is medical text, output is a taxonomy that relates to it.
2. Methodology for Creating a Subset Dynamic Taxonomy A dynamic taxonomy needs a set of documents which are multiply classified by topics embedded in an taxonomy. This paper describes a methodology for the construction of such a structure. It consists of 5 steps: 1) choose subset of medical text to work with, 2) generate possible index phrases, 3) stem and order phrases, 4) match phrases to UMLS concepts, and 5) grow taxonomy from found concepts. The choice of medical text to work with was arbitrary, but had some rationale. We chose the 6 sections covering the general topic of dermatitis from eTG: Dermatology because: · the content is related, which will help test precision of DT, i.e. evaluate the usefulness of using DT for fine grained distinction, and · there are many index words within each section, and as a result, the resulting taxonomy is dense. The sections were taken from the electronic version of the book, which is more atomised than the paper book. This is because disconnection from a linear book sequence necessitates sections that can stand alone. The text was divided into paragraphs, for later retrieval purposes. This division was done on a purely lexical level, with paragraphs being denoted as strings separated by 1 or more new line characters. An example paragraph is
Pompholyx, an acute and severe vesicular dermatitis of the hands or feet, often needs systemic corticosteroids. [10]. The next step was to find phrases in the text that could be used as index terms. This was done by first tagging the parts of speech in the text using the Brill tagger [14]. Then, possible index phrases were extracted by selecting maximum length consecutive groups of nouns and adjectives. The possible index phrases extracted from the above example include feet, severe vesicular dermatitis, and systemic corticosteroids. The resulting words were converted to a standard form using the UMLS provided dictionary assisted suffix trimmer, and then sorted into alphabetic order within the phrase. This process converts feet to foot, and systemic corticosteroids to corticosteroid systemic. The normalised phrases were then matched against a similarly processed master list of phrases representing all the concepts in UMLS. Concepts with exact matches were said to exist in the original text. Other work [7] used a technique of looking for an increasingly smaller portions of the phrase until a match was found, but our more restrictive condition found an adequate number of index concepts. In the example, pompholx matched the concept Dyshidrotic Eczema, while foot matched 3 concepts: the feet on the human body, problems of those feet, and the unit of measurement. At this point, we have a list of UMLS concepts relating to a set of text documents. To be used in dynamic taxonomy, these concepts need to be embedded in a taxonomy. The algorithm used to build the taxonomy is as follows. For each concept, repeatedly add any new PAR or RB link and
Extract all links that have as children the matched concepts, make them level0 relations level = 0; while more_relations_to_extract level = level + 1 for each concept Clevel-1 for each concept Clevel that is a parent to Clevel-1 if Clevel is already in the chain C0.. Clevel-1, mark link Clevel-1..Clevel as bad, to exclude it from further consideration (to eliminate loops) else if the link Clevel-1..Clevel has not already been extracted extract it extract concept Clevel loop loop loop Figure 5 Algorithm for generating taxonomy without cycles.
related concept where the target concept is of the same semantic type as the source concept. It is necessary to eliminate cycles during this process. The first link detected that would cause a cycle is marked as bad, barring it from being chosen for this path, or as part of any other path. Figure 5 shows a more formal explanation of the algorithm. The end result of the above algorithm is a set of separate hierarchies, because not all concepts have a common ancestor. We go on to use the UMLS semantic network to provide a single consistent root. As there are no cycles in the semantic network, this extraction algorithm is simpler: for each parentless node in the forest, find its semantic type. Then, from the semantic network, follow the IS-A relationships up to the root, extracting each relations traversed. Figure 6 shows the result of this process, a hierarchy consisting of some of the ancestors of the concept Dyshidrotic Eczema.
Due to the level of detail in UMLS and the fact that we work with a limited sample of medical text, the resulting hierarchy is narrow and deep, often having only one child per parent. We ameliorate this by promoting grandchildren which are children of sole children, eliminating the only child node. This has the advantage of making the hierarchy shallower but not wider.
3. Implementation of the Subset Creation Methodology The following methodology has been implemented using Oracle database. First, the UMLS dataset was imported into Oracle. The Therapeutic Guidelines content, originally in HTML form, was converted to raw text. Noun phrases were extracted using a Linux based Brill tagging program, stemmed, and then matched against the UMLS master phrase
Skin Diseases
Disease of skin and subcutaneous tissue not otherwise specified
Vesiculobullous Diseases
Skin Diseases Vesiculobullous
Skin Appendage Diseases
Diseases of the Epidermal Appendages
Sweat Gland Diseases
Dermatitis
Skin Diseases Eczematous
Noninfectious Vesicular and Bullous Diseases
Eczema
Dyshidrotic Eczema (pompholyx)
Figure 6 Sample of extracted taxonomy relating to concept Dysidrotic Excema. Child nodes are lower in the tree.
Entity T071
Physical Object T072
Anatomical Structure T017
Fully Formed Anatomical Structure T021
Conceptual Entity T077
Idea or Concept T078
Finding T033
Hand problem C0741992
Foot problem C0555980
Body Part Organ or Organ Component T023
Hand C0018563
Quantitative Concept T081
Feet, unit of measurement C0347981
Spatial Concept T082
Body Location or Region T029
Body Regions C0005898
Trunk NOS C0205075
Extremities C0015385
Lower Extremity C0023216
Pes C0016504
Figure 7 Partial taxonomy, consisting of parents of concepts derived from the strings in the text ‘hand’ and ‘foot’. The C number is the concept id from the UMLS metathesaurus, the T number is the semantic node id from the semantic network. list. From the list of matches, a taxonomy extracted from UMLS using Perl scripts and Oracle SQL queries. This process divided up the original 6 book sections into 146 paragraphs, from which 489 unique UMLS concepts were extracted. These concepts formed the base of a
hierarchy which contained 1021 distinct concepts and semantic network nodes, joined by 1328 relationships. Figure 7 shows a subset of the extracted taxonomy. It consists of the parent nodes to the concepts that were
generated due from the phrases hand and foot. Problems are apparent. Hand and foot are classified by UMLS editors into entirely different branches of the semantic network, defying common sense. These words are also ambiguous, having multiple senses for each. Figure 8 shows a partially expanded treeview of the taxonomy. The level of detail retrieved by this method is
•
have little meaning in an index, e.g. ‘agent’ or ‘week’, and there is a problem with scope. The concepts are assumed to apply at paragraph level; but some terms apply more widely. For example, in this sample, the term ‘dermatitis’ applies everywhere.
4. Future Work A weakness of this work is that it is difficult to judge the dynamic taxonomy that was created, as there is no gold standard by which to judge the precision and recall of the concept generation algorithm. A candidate for such a standard would be to match the current, manually assigned index terms to UMLS concepts. A solution to the problems with UMLS would be to use a more systematically constructed term set instead. SNOMED CT (Systematized Nomenclature of Medicine, Clinical Terms) [15], and GALEN (Generalised Architecture for Languages, Encyclopaedias and Nomenclatures in Medicine) [16] are two possibilities. While not as diverse as UMLS, they both were designed to be used in a more structured way, and as such, have the potential to be more systematically ordered.
5. Conclusion
Figure 8 Sample of taxonomy grown. The number in brackets after the term is the total number of unique paragraphs classified by either this term, or by the children of this term.
Medicine is a complex field, as is evident from the scope of UMLS itself. The mere addition of 8 million relationships to a problem does not in and of itself make it less complex. We must also add order. Dynamic taxonomy provides a way to order a complex field. This work shows a reasonable methodology for generating a sample subset that joins a section of text database to a taxonomy. The result will be useful for dynamic taxonomy.
6. Bibliography apparent. This would overwhelm a real user. We must reduce the detail in some way. During this sample generation process, errors were found. These include: • the noun phrase extraction process had low the precision and recall, because of the specialised medical vocabulary, • UMLS synonymy itself is a problem; words are sometimes misidentified into an incorrect context; for example, the string ‘proven’ is a type of antivenom, • even where the correct sense of the word is identified, the resulting terms sometimes would
[1] E. F. Mulhall, Experimental studies in recall and recognition, American Journal of Psychology, vol. 26, pp. 217-228, 1915. [2] G. Sacco, Dynamic taxonomies: a model for large information bases, IEEE Transactions on Knowledge & Data Engineering, vol. 12, pp. 468-79, 2000. [3] K. E. Campbell, D. E. Oliver, K. A. Spackman, and E. H. Shortliffe, Representing thoughts, words, and things in the UMLS, Journal of the American Medical Informatics Association, vol. 5, pp. 421-31, 1998.
[4] D. A. B. Lindberg, B. L. Humphreys, and A. T. McCray, The Unified Medical Language, System. Meth. Inform. Med., vol. 32, pp. 281-291, 1993. [5] A. T. McCray, A. R. Aronson, A. C. Browne, T. C. Rindflesch, A. Razi, and S. Srinivasan, UMLS Knowledge for Biomedical Language Processing, Bulletin of the Medical Library Association, vol. 81, pp. 184-194, 1993. [6] A. R. Aronson, Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program, presented at American Medical Informatics Association Annual Symposium, Washington DC, 2001. [7] P. Nadkarni, R. Chen, and C. Brandt, UMLS Concept Indexing for Production Databases: A Feasability Study, Journal of American Medical Informatics Association, vol. 8, pp. 80-91, 2001. [8] W. Pratt, Dynamic Categorization: A Method for Decreasing Information Overload, in Medical Information Sciences: Stanford University, 1999, pp. 171. [9] W. Pratt and L. Fagan, The Usefulness of Dynamically Categorizing Search Results, Journal of American Medical Informatics Association, vol. 7, pp. 605-617, 2000. [10] M. Stevens, eTG: Dermatology, 1 ed. Melbourne: Therapeutic Guidelines Ltd, 1999.
[11] D. Wollersheim, A Review of Decision Support Formats with Respect to Therapeutic Guidelines Limited Requirements, presented at Ninth National Health Informatics Conference, Canberra, ACT, Australia, 2001. [12] D. Wollersheim and W. Rahayu, Implemenation of Dynamic Taxonomies for Clinical Guidelline Retrieval, presented at International Conference on Medical Informatics, Hyderabad, India, 2001. [13] D. M. Pisanelli, A. Gangemi, and G. Steve, An Ontological Analysis of the UMLS Metathesaurus, presented at American Medical Informatics Association Annual Symposium, 1998. [14] E. Brill, Some Advances In Rule-Based Part of Speech Tagging, AAAI, 1994. [15] SNOMED, SNOMED International: SNOMED Clinical Terms (SNOMED CT), http://www.snomed.org/ snomedct_txt.html, 2001. [16] W. D. Solomon, C. J. Wroe, A. L. Rector, J. E. Rogers, J. L. Fistein, and P. Johnson, A Reference Terminology for Drugs, presented at American Medical Informatics Association Annual Symposium, 1999.