MEDINFO 2004 M. Fieschi et al. (Eds) Amsterdam: IOS Press © 2004 IMIA. All rights reserved
Knowledge Sharing and Information Integration in Healthcare using Ontologies and Deductive Databases Fabiane Bizinella Nardona,b, Lincoln A Mouraa,c aEscola
b Amazon Technologies do Brasil, Brazil Politécnica – University of São Paulo, Brazil c Atech Foundation, Brazil Fabiane Bizinella Nardon, Lincoln A Moura
logy can be the start point to a huge distributed medical knowledge base.
Abstract This paper describes a method for using Semantic Web technologies for sharing knowledge in healthcare. It combines deductive databases and ontologies, so that it is possible to extract knowledge that has not been explicitly declared within the database. A representation of the UMLS (Unified Medical Language System) Semantic Network and Metathesaurus was created using the RDF standard, in order to represent the basic medical ontology. The inference over the knowledge base is done by the TRI-DEDALO System, a deductive database created to query and update RDF based knowledge sources as well as conventional relational databases. Finally, an ontology was created for the Brazilian National Health Card data interchange format, a standard to capture and transmit health encounter information throughout the country. This paper demonstrates how this approach can be used to integrate heterogeneous information and to answer complex queries in a real world environment.
Methods Knowledge Representation and Sharing in Healthcare Standards for medical knowledge representation and sharing have been subject of research for many years. The most known initiatives are the Arden Syntax standard [2], the Guidelines Interchange Format (GLIF) [3], and the Good Electronic Health Record (GEHR) Project [4]. Besides these knowledge representation standards, there are standards that aim at allowing for data exchange among different institutions, like Health Level 7 (HL7) [5]. Despite the efforts towards an universal standard for representation of medical knowledge, none of the proposed standards reached a level of acceptance that would allow sharing health information in large scale.
Keywords:
There is a large number of medical vocabularies to code information in healthcare. The Unified Medical Language System (UMLS) [1] is an effort of the U.S. National Library of Medicine aiming at facilitating the retrieval and integration of information from multiple biomedical information sources. UMLS is probably the most comprehensive ontology in healthcare, as it defines relationships among a large number of different vocabularies [6]. The current edition of the UMLS Metathesaurus includes 875.255 concepts and 2.14 million concept names in its source vocabularies. UMLS also defines a semantic network that provides a categorization of all the concepts represented in its Metathesaurus and the relationships among them.
Knowledge sharing, information integration, deductive databases, Semantic Web, UMLS
Introduction The representation of medical knowledge is not a trivial task, as it is necessary to use a highly expressive, platform-independent formalism that would allow several scattered and poorly coordinated peers to communicate and exchange knowledge. There is also need for a powerful query language able to make complex queries over complex data. Trying to address a similar problem, i.e., integrate heterogeneous information available in the Web, the Semantic Web initiative created in the last years standards that aim at universal forms of knowledge representation. In this paper we use Semantic Web standards and concepts to pursue data integration and knowledge sharing in healthcare. Our approach is to represent the UMLS [1] Semantic Network and Metathesaurus using the RDF [8] and DAML+OIL |10] standards, using this representation as the basic ontology for medical concepts and a deductive database system for inferring and querying the knowledge base. As a test case, other ontology was created from the Brazilian National Health Card data interchange format, a standard for capturing and transmitting health encounter information throughout the country. Since Brazilian health providers will have to integrate their data to that standard, the National Health Card onto-
Since the relationships among medical concepts are essential to answer complex queries and to find the correspondence among concepts that represent the same information but are coded by different vocabularies, UMLS can be used as the basic ontology for any medical knowledge base. If a particular concept is coded by one of the UMLS supported vocabularies, it is possible to find the code for this concept in other supported vocabularies, thus integrating information stored using different coding schemas. Sometimes, however, a concept does not have a one-to-one correspondence in two different vocabularies: a code in one vocabulary may represent a more specific information than a code in another vocabulary. The UMLS addresses that situation specifying the relationships as “broader than” and “child of”.
62
F. B. Nardon et al.
ference Layer [9]. The DAML+OIL standard is the basis of the Ontology Web Language (OWL) [11], that has being created by the W3 Consortium in order to be the Semantic Web standard for ontology representation.
Besides correspondence among concepts coded by different vocabularies, the UMLS, through its Semantic Network, provides additional relationships among them. For example, the relationship “part of” can occur between concepts that are instances of “Anatomical Structure” and “Organism”.
As examples of concepts defined by the DAML+OIL standard there are: daml:sameClassAs, that allows to define that two classes with different URIs are actually the same class; daml:sameIndividualAs, that allows to specify that two resources with different URIs are the same resource; daml:UnambigousProperty, used to express that if two individuals have the same value for a particular property, they can be considered the same individual; daml:inverseOf, that defines that two properties are inverse one of the other, allowing, for example, to infer that if the “parent” property is the inverse of the “child” property and there is a statement defining that “Adam” is the father of “Cain”, then “Cain” is the child of “Adam”.
If a healthcare provider uses one of the UMLS supported vocabularies to code its data, it is possible, using the UMLS relationships, to infer new information not explicitly stored in the database (e.g., “liver cancer” is a kind of “cancer”, so if Peter has “liver cancer”, Peter has cancer). Also, if different information systems use different vocabularies, it is possible to find equivalencies between concepts through the UMLS Metathesaurus. The Semantic Web Standards for Knowledge Representation The Semantic Web initiative [7] aims at adding machine understandable semantics to the information available in the web, thus allowing for effective knowledge sharing and information integration. To achieve this goal, the Semantic Web created a standard for knowledge representation called Resource Description Framework (RDF) [8]. Knowledge in RDF is represented by statements (also known as triples). A statement is a declaration in the form , meaning that subject has a property predicate whose value is object. For example, the statement means that a subject Peter has a diagnosis of appendicitis. An RDF statement represents a direct graph, where the subject and object are nodes and the predicate is the arc. This simple form of knowledge representation has proved to have a highly expressive power and allows for representation of knowledge in every domain.
Concepts defined by the RDF and DAML+OIL standards are just for information, i.e., there is no embedded mechanism in these standards that guarantee that a particular schema is being respected by an associated knowledge base. Besides that, there is not yet a standard query and inference language for RDF knowledge bases. Inference on RDF Knowledge Bases Using Deductive Databases In order to query and infer embedded facts from an RDF knowledge basis, it is necessary a query language and an inference mechanism. There are several implementations of query languages that can extract information from RDF documents [12]. Some of those languages use deductive databases technology to make inferences and to answer complex queries. Deductive Databases [13] are databases that provide all the services of a conventional Database Management System and, additionally, allow for deduction of new information using the data explicitly inserted in the database. Deduction of new facts is done by a set of deductive rules that are part of the database schema. The relations that contain the explicitly inserted facts are known as basic relations and the relations that contain the inferred facts are known as derived relations. Deductive Database Systems use a declarative logic based query language that allows to express recursive queries.
There are two features of RDF that make it more interesting than any knowledge representation standard ever proposed. The first feature is that every concept in RDF has a universal unique identifier, the Uniform Resource Identifier (URI). The URI standard was created in the core of the web and is used to uniquely identify e-mail addresses, Web Pages (using a special kind of URI, the Uniform Resource Locator or URL), and so on. Since it is impossible to have in RDF two concepts identified by a single identifier, there is no semantic ambiguity in the Semantic Web. The second feature that makes RDF an interesting standard is its simplicity and straightforward way of expressing information. This feature makes machine processing of the knowledge base easier and makes the knowledge expressed in RDF understandable by humans.
Usually, deductive databases use a Prolog-like query language, called Datalog. Datalog [14] is a declarative logic-based query language that does not have pre-defined predicates, negation, disjunction or functional symbols. A rule in Datalog is evaluated through the derivation of the set of all possible constants that make the head of a rule become true. This set of constants is then assigned to the new derived relation represented by the head of the rule.
RDF also defines a series of constructs that allow the creation of knowledge representation schemas. Among these constructs are rdfs:Class, for class of concepts definition; rdfs:Datatype, for data type definitions; rdfs:range, that is used to determine that the possible values of a particular predicate should be instances of a particular class, and several other generic constructs not associated to any particular domain.
The advantage of using Deductive Databases instead of other Artificial Intelligence inference mechanisms is that the former were conceived to handle large databases. Other similar mechanisms, like Prolog, usually deal with a small base of facts. Since the databases existing in typical Semantic Web applications, and also in healthcare applications, are very large, it is important to have inference mechanisms that can cope with large amounts of data.
The RDF standard does not have constructs that allow to represent ontologies and, because of that, there are extensions to RDF created for this purpose. The more widely used extension to RDF to express ontologies is the DAML+OIL standard, formed by the DARPA Agent Markup Language [10] and by the Ontology In-
63
F. B. Nardon et al.
and even powerful enterprise architectures such as Enterprise Java Beans.
In order to provide a query and inference mechanism for RDF knowledge bases, we created the TRI-DEDALO (TRIples, DEduction, DAta and LOgic) [15] system, a deductive database that uses a Datalog extension as its query language. The TRI-DEDALO query language supports negation, aggregate functions, arithmetic operations, disjunction, comparison operations, update operations and fuzzy reasoning. All of these features are absent in Datalog. Additionally, the TRI-DEDALO notation allows to reference the relation attributes by their names and not by their position. It is also possible to specify ordering for the query result set. The main goal of the TRI-DEDALO system is to build a deductive database that can be used in real world applications and the extensions added are essential to achieve this goal.
For example, the rule: ControlledDrugs(Medicine:X) :stm(s?X p?, o?) , not(stm(s?x, p?, ”liberated”)). would be translated into the following SQL statement: SELECT DISTINCT a0.subject AS Medicine FROM statements a0 WHERE a0.predicate = “http://www.samplehospital.com.br#isa” AND a0.object = “http://www.samplehospital.com.br#drugType” AND NOT EXISTS(SELECT * FROM statements a1 WHERE a0.subject=a1.subject AND a1.predicate=“http://www.samplehospital.com.br#drugType” AND a1.object=”liberated”)
A complete description of the TRI-DEDALO system is out of the scope of this paper, but can be found elsewhere [15][16]. As far as this paper is concerned, it suffices to say that the TRIDEDALO system has features that allow using RDF statements, or triples, as basic relations in the body of the rule. Also, statements can be used in the head of the rule, allowing, this way, to infer new statements. One of the features added to the TRI-DEDALO system in order to support RDF statements is the syntax to specify namespaces. A namespace is specified in the TRI-DEDALO language in the form: xmlns:name=. where name is the namespace identifier and uri is the associated URI.
Translating RDF and DAML+OIL concepts in TRI-DEDALO
Another feature is the support of URIs as constants associated to attributes and also as identifiers for relations and attributes. A third feature is the existence of a special statement relation. This relation has three fixed attributes: subject, predicate and object, thus allowing the representation of RDF statements as a relation in TRI-DEDALO. A statement can be represented in a TRIDEDALO rule in the form:
When an RDF or DAML+OIL document is loaded in a TRIDEDALO knowledge base, its statements are translated in terms of tuples of the statement relation and deduction rules. By translating a DAML+OIL ontology in TRI-DEDALO rules, it is possible to use the knowledge expressed on the ontology to answer to queries using not only the explicitly inserted facts in the database, but also the derived information inferred from these facts by the deductive rules.
statement(subject?arg1, predicate?arg2, object?arg3) where statement, subject, predicate and object are reserved words representing, respectively, a statement and its subject, predicate and object; each of the argn is a variable, an URI, a constant or an arithmetic expression. It is possible to use stm, s, p and o as short forms to represent statement, subject, predicate and subject. If the words subject, predicate, object or their short forms are used, the order in which they appear in the statement is not important and any of them can be omitted. Also, it is possible to use an implicit syntax in a statement: if the words subject, predicate and object are not present, TRI-DEDALO will assume that the first argument represents the subject, the second represents the predicate and the third represents the object.
A UMLS RDF/DAML+OIL Ontology The representation of the UMLS knowledge base as an RDF/ DAML+OIL ontology allows to capture the semantics of health concepts and the relationships among these concepts. This information is important to share knowledge and to infer new information. The complete ontology created as part of this work for the UMLS knowledge base can be found at . The UMLS ontology represents the Semantic Network concepts and relationships and the Metathesaurus relationships in DAML+OIL, allowing future applications to use a common ontology to represent knowledge.
The TRI-DEDALO system is an open source software implemented using the Java programming language. TRI-DEDALO acts as an additional layer to any relational database. The sentences expressed in TRI-DEDALO language are translated in a set of SQL statements that are submitted to the underlying database by the TRI-DEDALO server. A JDBC driver is available to allow clients to connect to the TRI-DEDALO server. Using this approach, any application that supports a JDBC driver can connect to the TRI-DEDALO system, including many Java tools
The Brazilian National Health Card RDF/DAML+OIL Ontology The Brazilian National Health Card is an ongoing project sponsored by the Ministry of Health aiming at creating an infrastructure for capturing encounter information at the point of care. The health card itself is a magnetic card with the person’s name, date
64
F. B. Nardon et al.
lowed, as there is no record in Information System 1 that the patient underwent a tomography in the last 30 days. However, if the Brazilian National Health Card and the UMLS ontologies are present, a simple set of TRI-DEDALO rules can be used to describe the business rules so that there is mapping from the local database schema to the ontologies and the Information System can infer that:
of birth and an unique identifier. The system behind the health card stores clinical events and has a multi-level architecture (local, regional, state and federal). This system currently holds information of some 8 million patients, as part of the pilot-project. The National Health Card System runs over a network that connects the healthcare providers in Brazil. The project defines a set of XML messages that have to be used in order to send information among the government levels.
1. David Smith and Dave C. Smith are the same person because the National Health Card Ontology states that the attribute “IdentificationDocument” is an UnambiguousProperty. As both patients have the same IdentificationDocument (123.456), they are the same person; 2. The procedures SPECT and PET are both tomographies, since by the UMLS Metathesaurus, there is a child relationship between these concepts and the concept Emission Tomography and there is a child relationship between this concept and Tomography. Since the property child is defined in the UMLS ontology as a Transitive Property, the TRI-DEDALO system infers that SPECT and PET are tomographies; 3. Since David Smith and Dave C. Smith are the same person, the TRI-DEDALO system automatically infers that all the information valid for one is valid for the other. So, the patient underwent a SPECT in 07-15-2003 and a PET in 09-15-2003; 4. Since PET and SPECT are both tomographies, the HMO constraint was violated for this patient and the procedure shall not be authorized.
In order to explore the knowledge base created for the National Health Card and provide additional semantics to the data handled by it, we propose a DAML+OIL ontology for the Brazilian National Health Card Project. This ontology can be used not only to answer complex queries but also to help healthcare providers with legacy systems to transform their data into the Ministry of Health format. Using the TRI-DEDALO deduction rules and the UMLS knowledge base, this can be accomplished. The complete ontology can be found at .
Results Using a combination of ontologies and the TRI-DEDALO deductive database it is possible to share knowledge in healthcare in an efficient and flexible way. The process of information integration starts with a healthcare organization mapping its information to the UMLS and, if in Brazil, to the National Health Card ontologies. Once this mapping is available, TRI-DEDALO rules can be used to describe business rules and from them infer information such as semantic equivalent concepts represented in different forms, hierarchy of concepts and so on, achieving semantic interoperability.
Since the HMO rule is defined using RDF statements and concepts specified in the ontologies, the rule can be expressed in a schema independent way, which allows for knowledge sharing.
To illustrate the use of the technologies and concepts presented in this paper, an example of a real world case of knowledge sharing in healthcare will be presented in this section.
Discussion The UMLS is the most comprehensive ontology available for the healthcare domain. The relationships specified by the Semantic Network are very useful and allow to infer a great deal of relevant information using the mechanisms presented in this paper. Unfortunately, however, not every UMLS Semantic Network relationships is expressed in the UMLS knowledge base. In that sense, in our experiments, the Metathesaurus relationships, although not as complete as the Semantic Network ones, are more useful for data integration and knowledge sharing, since they are present for almost every concept in the knowledge base. For example, although there is the isa relationship in the Semantic Network and clearly that relationship exists between the concepts PET and Tomography, this fact is not represented in the UMLS knowledge base. The child relationship, defined by the UMLS Metathesaurus, however, is represented in the UMLS knowledge base. Although in this example child and isa would have the same meaning, the child relationship is ambiguous in the UMLS Metatheaurus, since it has sometimes the meaning of a part of relationship.
A Real World Example of Knowledge Sharing A Brazilian HMO has a rule stating that the same individual cannot undergo more than one tomography in a period of thirty days. Let us consider that two different information systems have procedure information stored in the following schemas. Information System 1: PatientId: 456; DocumentId: 123.456; Procedure: C0040399; PatientName: David Smith; Date: 07-15-2003 Information System 2: Id: 123; Doc: 123.456; Procedure: C0032743; Name: Dave C. Smith; Date: 09-15-2003 In Information System 1 it is recorded that patient David Smith, internally identified by the Id 456, that has an identification document number 123.456, was submitted to a SPECT procedure (code C0040399 in the UMLS Vocabulary) on 07-15-2003. In Information System 2 is stored the information that patient Dave C. Smith underwent a PET scan (C0032473) on 09-15-2003.
Conclusions
On 09-20-2003, patient David Smith asks for authorization to schedule a SPECT at the health provider using Information System 1. If there is no knowledge sharing, the procedure will be al-
Using a combination of ontologies and the TRI-DEDALO deductive database it is possible to share knowledge in healthcare
65
F. B. Nardon et al.
in an efficient, elegant and flexible way. Using standards such as RDF and DAML+OIL, which are accepted by a large number of organizations, a larger number of knowledge bases can be made available to share knowledge. As it is well known, as the number of shareable knowledge bases increase, so increases the value of the knowledge network.
[13]Ramakrishnan, Raghu; Ullman, Jeffrey. A Survey of Research on Deductive Database System. Journal of Logic Programming. New York. p. 125-149, May 1995.
The classification of heterogeneous data sources using ontologies makes it possible to establish the relationship among concepts, achieving this way the much needed semantic interoperability. In healthcare, a highly heterogeneous, distributed and complex domain, the possibility of sharing information can greatly improve the quality of care.
[15]TRI-DEDALO. Triples, Deduction, Data and Logic. See: http://www.tridedalo.com.br
[14]Ceri, Stefano; Gottlob , G.; Tanca, Letizia. Logic Programming and Databases. Berlin: Springer-Verlag, 1990. 284p. (Surveys in Computer Science).
[16]Nardon, Fabiane; Moura Jr, Lincoln; Leão, Beatriz. Using RDF and Deductive Databases for Knowledge Sharing in Healthcare. In: 2nd International Semantic Web Conference (ISWC2003). Sanibel Island, FL, EUA. 2003.
The TRI-DEDALO deductive database was designed to support real world applications and a great deal of effort was made in order to make it fast and flexible enough to handle large data repositories. The fact that the TRI-DEDALO system can be used in conjunction with any SQL compliant relational database, makes it possible to add the deductive features to legacy databases.
Address for correspondence Fabiane Bizinella Nardon Rua do Rocio, 313 – 4th floor 04552000 - São Paulo - SP – Brazil
[email protected]
References [1] National Library of Medicine. Unified Medical Language System. 14th. ed. National Library of Medicine, 2003. [2] Hripcsak, George. Writing Arden Syntax Medical Logic Modules, In: Computers in Biology and Medicine. 331-63. 1994. [3] Deibel, Stephan R. A.. Introduction to the InterMed Common Guideline Model and Guideline Interchange Format (GLIF). Brigham & Women’s Hospital. Harvard Medical School. Boston. November, 1996. See: http://www.glif.org/ glif_overview.html [4] GEHR. The Good Electronic Health Record. See http:// www.gehr.org. [5] HL7. Health Level 7. See: http://www.hl7.org. [6] McCray, Alexa. An upper-level ontology for the biomedical domain. In: Comparative and Functional Genomics. 80-84. 2003. [7] Berners-Lee, Tim; The Semantic Web. Scientific American. May, 17th, 2001. [8] World Wide Web Consortium. RDF Vocabulary Description Language 1.0: RDF Schema. See http://www.w3.org/ TR/rdf-schema/. [9] Fensel, D. OIL in a Nutshell. European Knowledge Acquisition Conference (EKAW-2000). Lecture Notes in Artificial Intelligence, Springer-Verlag, 2000. [10]DARPA. DARPA Agent Markup Language (DAML). 2003. See . [11]World Wide Web Consortium. OWL Web Ontology Language Guide - Working Draft. See . [12]Prud’hommeaux, Eric; Grosof, Benjamin. RDF Query and Rules: A Framework and Survey, World Wide Web Consortium, 2003. See: http://www.w3.org/2001/11/13-RDFQuery-Rules/
66