Data Quality Enhancement of Databases Using Ontologies and ...

2 downloads 0 Views 411KB Size Report
gies by induction on source databases and enhance data quality features ... integrated in valuable databases using inductive reasoning (the output of this.
Data Quality Enhancement of Databases Using Ontologies and Inductive Reasoning Olivier Cur´e1 and Robert Jeansoulin2 1

2

S3IS Universit´e Paris Est, Marne-la-Vall´ee, France [email protected] IGM Universit´e Paris Est, Marne-la-Vall´ee, France [email protected]

Abstract. The objective of this paper is twofold: create domain ontologies by induction on source databases and enhance data quality features in relational databases using these ontologies. The proposed method consists of the following steps : (1) transforming domain specific controlled terminologies into Semantic Web compliant Description Logics, (2) associating new axioms to concepts of these ontologies based on inductive reasoning on source databases, and (3) providing domain experts with an ontology-based tool to enhance the data quality of source databases. This last step aggregates tuples using ontology concepts and checks the characteristics of those tuples with the concept’s properties. We present a concrete example of this solution on a medical application using wellestablished drug related terminologies.

1

Introduction

The emergence of the Semantic Web and other semantic dependent applications encourages the design and use of ontologies. Although efficient and user-friendly ontology engineering tools (edition, visualization, storage, etc.) are now available, the design from scratch of domain ontologies is still considered a difficult and burdensome task. An approach generally adopted is that database schemata can support and thus ease, accelerate the design of expressive ontologies. Among these approaches, the most widely used aim to define a mapping between the source database schemata and a target ontology (see [20] as a survey of solutions on this topic). In this paper, we propose another approach which takes advantages of the large number of databases maintained in the world as well as the many available hierarchical classifications, thesauri and taxonomies. In the ontology research field, these notions are associated with a set of concepts that are more or less strictly organized in hierarchies. The fundamental work of [6] analyzes the meaning of the taxonomic relationships and highlights that multiple types of taxonomic relationships exists. Also, as proposed in [14], depending on the context it is possible to interpret the hierarchical organizations of these terminologies as defining partial order relations on their concepts. In this paper, we concentrate R. Meersman and Z. Tari et al. (Eds.): OTM 2007, Part I, LNCS 4803, pp. 1117–1134, 2007. c Springer-Verlag Berlin Heidelberg 2007 

1118

O. Cur´e and R. Jeansoulin

on the contexts where such properties can be assumed and we therefore call them classifications. The aim of our approach is twofold. In a first phase, we enrich classifications integrated in valuable databases using inductive reasoning (the output of this processing is henceforth called an ontology). In a second phase, we provide a graphical user interface which exploits ontologies created in phase one to detect and ease the repairing of inconsistencies in the source databases. A main idea of our approach is to consider that the axioms added to the ontologies correspond to additional source database data dependencies. These data dependencies are strongly related to instances of the database and their representation are generally not supported in most standard relational database management systems (RDBMS). So this approach enables to store these data dependencies as valuable properties of expressive ontologies. Starting from these resulting ontologies, we check if the source database violates some of these data dependencies and propose a data cleaning solution. Due to possible exceptions in the dataset, the automatic repairing of data may not be pertinent. Nevertheless, we assist the end-user by automatically detecting possible violations and by offering an effective and user-friendly graphical user interface to semi-automatically enable data repairing. By repairing violations, we aim to enhance the data quality of the source databases. In this paper, we are being influenced by ISO 9000 Quality standards and consider two data quality elements : completeness and correctness. In terms of completeness, we are interested in the presence or absence of data in the dataset. We distinguish between two aspects of completeness : (i) comission, i.e. when excess data are present in the dataset, and (ii) omission, i.e. when some data are absent from the dataset. Considering the correctness aspect, we expect the dataset to contain the correct data. This paper is organized as follows: in section 2, we present the basic notions involved in our data quality enhancing framework. In section 3, we highlight the motivating example of this research which is related to medical informatics and exploits drug databases. In section 4, the terminology enrichment using inductive reasoning approach is presented. Section 5 introduces our ontology-based detection and repairing solution and evaluates its efficiency on the drug database example. Section 6 emphasizes the adaptability of our solution to other domains such as geographical information. Some related works are proposed in Section 7 and Section 8 concludes the paper and emphasizes on some future work.

2

Basic Notions

This section reviews the main notions, related to relational databases and Description Logics (DL), needed to present our framework. A fixed database schema is a finite sequence R={R1 , .., Rn } of relation symbols Ri , with 1 ≤ i ≤ n, that are of fixed arity and correspond to the database relations. We define a relation Ri as a set of attributes {A1i , .., Aki } where Aji , with 1 ≤ j ≤ k, denote attributes over a possibly infinite domain D.

Data Quality Enhancement of Databases

1119

An instance I over a schema R is a sequence (R1I , .., RkI ) that associates each relation Ri with a relation of the same arity RiI . In this paper, we often abuse the notation and use Ri to denote both the relation symbol and the relation RiI that interprets it. We call a fact R(t) the association between a tuple t of a fixed arity and a relation R of the same arity. If R is a schema then a dependency over R is a sentence in some logical formalism over R. The ontologies are represented using a DL formalism [2]. This family of knowledge representation formalisms allows the represention and reasoning over domain knowledge in a formally and well-understood way. We assume readers are familiar with the semantics of DLs, though we recall that the syntax for concepts in SHOIN (D) [15] are defined as follows, where Ci is a concept, A is an atomic concept, R is an object role, S is a simple object role, T is a datatype role, D is a datatype, oi is an individual and n is a non-negative integer : C → A | ¬ C1 | C1  C2 | C1  C2 | ∃ R.C | ∀ R.C | ≥ n S | ≤ n S | {o1 , .., om } | ≥ n T | ≤ n T | ∃ T.D | ∀T.D The reason of our interest in the SHOIN (D) DL is its syntactical equivalence with the OWL DL language [11], an expressive ontology language developed by the W3C and which is already supported by numerous tools (editors, reasoners, etc.).

3

Motivating Example

The motivating example of our approach is related to the data quality assessment in a self-medication application [9] and a drug database. In a nutshell, our self-medication application enables patients to maintain medical information in a personal health care record, access data on mild clinical signs and over the counter drugs. In such a context, the data quality of the drug database is fundamental as incorrect or incomplete data, e.g. contra-indications, may exacerbate the health condition of the patient. The drug database stores, for all drugs sold in the french market, all the information available on the Summary of Product Characteristics (SPC) as well as some extra information such as a rating, based on an efficiency/tolerance ratio, and a comment from a team of health care professionals. Some of the most used classifications in the drug domain have also been integrated in our database and are used to link drug products to a certain concept in these classifications. In the rest of this section, we introduce the main characteristics of these drug classifications. 3.1

Drug Terminologies

Controlled terminologies and classifications are widely available for health care and bioinformatics [7]. In this paper, we are interested in the drug related classifications that are the most used in french drug databases, namely the European c and the AnatomiPharmaceutical Market Research Association (EphMRA) cal Therapeutic Chemical (ATC) classifications. As most french drug databases, we use the CIP french codes to identify products. For instance, the drug Tussic syrup is sold as a 250ml bottle product identified with CIP value 3622682 dane

1120

O. Cur´e and R. Jeansoulin

and a 125ml bottle product with CIP value 3622676. So each product has a distinct CIP code and many CIPs may be available for a given drug, one for each product presentation. EphMRA classification. The EphMRA brings together European, researchbased pharmaceutical companies operating on a global perspective. One of the missions of the EphMRA is to provide recognised standards by continuously supporting and actively participating in establishing high levels of standards and quality control in pharmaceutical marketing research. The Anatomical Classification system (AC-system) is the main classification developed by the EphMRA, with its sister organisation in the USA, the Pharmaceutical Business Intelligence and Research Group (PBIRG). This system represents a subjective method of grouping certain pharmaceutical products. The products are classified according to their main therapeutic indication and each product is assigned to one category. In the AC-system, categories are organized on a cascade of 4 levels where each sub-level gives additional details about its upper-level. The first level of the code is based on a letter for the anatomical group and defines 14 groups. The second level is used to regroup several classes together, in order to classify according to (i) indication, (ii) therapeutic substance group and (iii) anatomical system. This level adds a digit to the letter of the first level and enables the creation of the cascade classification. Therefore, before creating a new second level, all existing possibilities of classification should be analyzed. There could be cases where it is necessary to create a second level without a cascade to the third or fourth level. However, these cases are seldom in the current classification. The third level adds a letter to a second level code and describes a specific group of products within the second level. This specification can be a chemical structure or it can describe an indication or a method of action. The fourth level gives more details about the elements of the third level (formulation, chemical description, mode of action, etc.). Fourth level codes add a digit to third level ones. The complete hierarchy for antitussive drugs corresponds to : R: Respiratory system R5: Cough and cold preparations R5D: Antitussives R5D1: Plain antitussives R5D2: Antitussives in combinations ATC classification. The ATC system [22] proposes an international classification of drugs and is part of WHO’s initiatives to achieve universal access to needed drugs and rational use of drugs. In this classification, drugs are classified in groups at five different levels. In fact, the ATC system modifies and extends the AC-system of EphMRA. Thus the first level is composed of the 14 groups of the EphMRA system. The second is also quite similar and corresponds to a pharmacological/therapeutic subgroup. The third and fourth levels are chemical/pharmacological/therapeutic

Data Quality Enhancement of Databases

1121

subgroups. Finally, the fifth level corresponds to the chemical substances. With its fifth level, the ATC classification enables to classify drugs according to Recommended International Non-proprietary Names (rINN). This is different from EphMRA’s classification where the leafs of the tree (fourth level) give details on a wider perspective (formulation chemical description, mode of action, etc.). Thus we consider that the use of both terminologies in our data quality assessment approach is complementary. We now provide an extract from the ATC hierarchy for some cough suppressants: R: Respiratory system R5: Cough and cold preparations R05D: Cough suppressants, excluding combinations with expectorants R05DA: Opium alkaloids and derivatives R05DA01 Ethylmorphine ... R05DA08 Pholcodine ... R05DA20 Combinations The R05DA20 code identifies compound chemical products that combine opium alkaloids with other substances. An example for this code is the Hexapneuc syrup which contains the following chemical substance : pholcodin, chlormine phenamin and biclotymol which respectively correspond to R05DA08, R06AB04 and R02AA20. In our approach, we argue on the relevance of classifying such products with the conjuction of their codes rather than with a single unifying code.

4

Terminology Enrichment Using Inductive Reasoning

The purpose of the terminology enrichment is to enable the aggregation of database tuples in a coherent way such that common properties can be discovered and associated to ontology concepts. Intuitively, we consider that a relation, or a set of relations, T erm in the database schema R stores a given terminology, e.g. the ATC classification. Let consider that a relation Ind stores individuals of the database domain, e.g. drug products. Then it is most likely that a one-to-many relation, or a chain of relations, T ermInd relates facts between T erm and Ind. We can also assume that properties, e.g. contraindications or side-effects, about these individuals are either directly stored in Ind or stored in a relation P rop in which case a possibly many-to-many relation P ropInd relates facts between P rop and Ind. Thus the Ind relation plays a central role in our solution as it enables to join elements from T erm to elements of P rop. Given a fact in T erm, it is possible to aggregate tuples from Ind, via T ermInd, in a sufficiently coherent manner and to extract valuable properties from these groups. It is straightforward that such a query can be performed using SQL queries in most RDBMS.

1122

4.1

O. Cur´e and R. Jeansoulin

Transformation into a Description Logic Formalism

In order to perform such enrichment, it is first necessary to transform the terminologies stored in T erm into a DL formalism. This step has been performed using the DBOM [8] [10] Prot´eg´e plugin [13]. In a nutshell, DBOM (DataBase Ontology Mapping) enables to design and enrich existing OWL ontologies from relational databases by mapping relations to concepts and roles of the TBox via SQL queries. DBOM proposes a solution to the impedance mismatch problem between relational schemata and DL-based knowledge bases (KB) and supports the creation of simple and complex matching with declarative mappings. DBOM also enables the creation of ABoxes by processing the SQL queries associated to each mapped concepts and roles. Finally, this system also provides a preferencebased approach to deal the inconsistencies. For the terminology transformation operations, DBOM’s input is the database schema R. The end-user defines mappings between relations of R and concepts of the ontology. The output is an OWL DL ontology [11]. In the context of our motivating example, the classification codes stored in T erm, e.g. EphMRA or ATC codes, are transformed into OWL concepts. The classification levels are represented as a subsumption of concepts, i.e. using rdfs:subClassOf properties, and sibling concepts are declared disjoint, i.e. using owl:disjointWith properties. In the following extract of our ATC ontology, we provide the description associated with the Pholcodine chemical substance concept, whose ATC code is R05DA08. On line 1, we can see that this concept is identified by a given URI with a local name corresponding to R05DA08 ATC code. Line 2 defines the concept associated with the R05DA code (Opium alkaloids and derivatives) to be a super concept of this concept. On lines 3 and 4, we present examples of disjointWith properties between sibling concepts, only the first and last concepts are displayed for briefty reasons. Line 5 states a comment in the french language. 1. 2. 3. ... 4. 5. Pholcodine 6. 7. 4.2

Enriching the Description Logic Via Induction

Starting from such an OWL ontology, which we now call O, it is now possible to perform an enrichment by inductive reasoning on the information stored in the source database. Intuitively, the method presented exploits the one-to-many and many-to-many relations holding between the relations Ind, T erm and P rop where tuples from Ind are first class citizen in the induction approach. Given the relations {T erm, Ind, T ermInd, P rop, P ropInd} of a database instance R and the top concept ( ) of O, the methods proceeds as follows. In a first

Data Quality Enhancement of Databases

1123

step, it is necessary to select the relation P rop that leads the inductive process and to create an OWL concept associated to this relation, e.g. ContraIndication, and an object property, π, that relates concepts of O to this newly created concept, e.g. hasContraIndidication. We can now present the Induction-based ontology enrichment (IBOE) algorithm in which we consider that KB denotes the knowledge base being enriched, and a threshold θ that has been previously defined in the system. We consider that a binary relation P ropInd (respectively T ermInd) follows the pattern {A1 , A2 }, where these attributes correspond respectively to the primary keys in Ind and P rop (respectively T erm for T ermInd ). The input concept of IBOE is the starting concept from which we want to start the Domain ontology. It is generally assumed to start this process with the

concept. Algorithm 1. Induction-based ontology enrichment (IBOE) algorithm Input : concept c of the ontology Output : enriched ontology O 1. for each sub concept sc of c do 2. size = number of individuals in Ind that are associated, via T ermInd, to tuples of T erm related to sc 3. if size>0 then 4. for each {p,count} retrieved from the query SELECT P ropInd.A2 , count(*) from T ermInd,P ropInd WHERE T ermInd.A2 ilike ’sc%’ AND T ermInd.A1 = P ropInd.A1 GROUP BY P ropInd.A2 HAVING count(*)/ size ≥ θ do; 5. if no individual corresponding to p is in the KB then 6. create individual(p) 7. associate π owl:hasValue individual(p) for the concept sc 8. else 9. if none of the superclasses of sc already has the property π then 10. retrieve individual(p) from KB 11. associate π owl:hasValue individual(p) for the concept sc 12. IBOE(sc) 13. end do In the context of our medical example, we provide with Figure 1 an extract from our drug database dedicated to medical contra-indications. Figure 1a proposes an extract, not all columns are displayed, of the Drug relation with two drugs and with their respective CIP identifiers. Figure 1b presents an incomplete list of the ContraIndication relation which stores all terms related to drug contra-indications. Now the ProductContraIndication relation enables to relate products identified by CIPs with their contra-indications (Figure 1c). Figure 1e provides an extract from the relation dedicated to the EphMRA classification (respectively Figure 1g for the ATC classification). Finally, two relations relate EphMRA and ATC codes to CIPs, respectively ProductEphMRA (Figure 1d) and ProductATC (Figure 1f) relations.

1124

O. Cur´e and R. Jeansoulin

Fig. 1. Extract of the XIMSA drug database

In the following, we present the inductive reasoning method on the EphMRA ontology, also named AC-ontology, and stress that an adaptation for the ATC ontology is straightforward. The method used to enrich the AC-ontology is based on induction reasoning on relevant groups of products, generated using the AC hierarchy. Intuitively, we navigate in the hierarchy of AC concepts and create groups of products for each level, using the ProductEphMRA relation, i.e. T ermInd on the IBOE algorithm. Then, for each group we study the information contained in the ContraIndication relation, i.e. P rop in our database R and for each possible value in this domain we calculate the ratio of this value occurences on the total number of elements of the group. Table 1 proposes an extract of the results for the concepts of the respiratory system and the contra-indication domain. This table highlights that our self medication database contains 56 antitussives (identified by AC code R05D ), which are divided into 44 plain antitussives products (R05D1 ) and 12 antitussives in combinations (R05D2 ). For the contra-indication identified by the number 76, i.e. allergy to one of the constituants of the product, we can see that a ratio of 1 has been calculated for the group composed of the R AC code. This means that all 152 products (100 %) of this group present this contra-indication. We can also stress that for this same group, the breast-feeding contra-indication (#9) has a ratio of 0.48, this means that only 72 products out the 152 of this group present this constraints.

Data Quality Enhancement of Databases

1125

Table 1. Analysis of contra-indications for the respiratory system

occurences ContraId 9 21 76 108 109 110 112 129

R R05 R05D R05D1 R05D2 152 71 56 44 12 .48 .26 1 .34 .35 .34 .34 .4

.83 .39 1 .69 .66 .73 .71 .56

.86 .3 1 .84 .8 .89 .88 .8

.82 .2 1 .84 .8 .86 .86 .82

1 .73 1 .82 .82 1 .91 0.16

We now consider this ratio as a confidence value for a given AC-concept on the membership of a given domain’s value. This membership is materialized in the ontology with the association of an AC-concept to a property, e.g. the hasContraIndication property, that has the value of the given contra-indication, e.g. breast-feeding (#9). In our approach, we only materialize memberships when the confidence values are greater or equal to a predefined threshold θ, in the contra-indication example this value is set to 0.6. This membership is only related to the highest concept in the AC hierarchy and inherited by its sub-concepts. For instance, the breast feeding contraindication (#9) is associated to the R05 AC-concept as its confidence value (0.83) is the first column on the contraId 9 line that displays a confidence value greater or equal to θ (0.6) in the R hierarchy. Also, the pregnancy contraindication (#21) is related to the R05D2 AC concept since its value is (0.73). Using this simple approach, we are able to enrich the AC-ontology with axioms related to several features of the SPC, e.g. contra-indication, side-effects, etc. At the end of this enrichment phase, the expressiveness of the newly generated ontology still corresponds to an OWL DL ontology. The following code proposes an extract of the AC-ontology, in RDF/XML syntax, where we can see the definition of R05D2 concept (line #1 to #12). This description states that the concept : – has the contra-indication identified by CI 21 (line #2 to #7) which corresponds to pregnacy (line #13 to #16). – is a subconcept of the R05D concept (line #8) – is disjointWith the concept identified by the R05D1 code – has a comment, expressed in the french language (line #10). 1. 2. 3. 4.

1126

O. Cur´e and R. Jeansoulin

5. 6. 7. 8. 9. 10. ANTITUSSIFS EN ASSOCIATION 11. 12. 13. 14. grossesse 15. 16. 4.3

Ontology Refinement

In some situations, it may be necessary to revise the implantation of properties in the concept hierarchy. Table 2 highlights such a case if we consider that x2 ≥ x1 ≥ θ but x3 ≤ θ and concepts N2 and N3 are siblings and sub-concepts of N1 . The execution of the IBOE algorithm attaches property px to the N1 concept as it is the first concept in the hierarchy that has a value, i.e. x1 , greater than θ. Table 2 emphasizes that the property is disbelieved for instances of the concept N2 but it still holds for the concept N2 . Thus we consider that it is necessary to refine the attachment of px to the concept hierarchy. Figure 2 presents the refinement policy by changing the ontology from a state where px is associated to N1 (Figure 2a) to a state where px is attached to all subconcepts of N1 with a confidence value greater or equal to θ (Figure 2b), in this case only the N2 node. The line identified by contra-indication #129 in Table 1 highlights the need to refine the ontology. The resulting ontology has contra-indication #129 attached to the R05D1 and not to R05D as originally deduced by the IBOE algorithm.

Fig. 2. Ontology refinement

Data Quality Enhancement of Databases

1127

This method can easily be applied to the ATC ontology or other drug related ontologies as soon as we consider that the ontology is presented in a DL formalism and a relation relates CIPs to identifiers of this ontology.

5 5.1

Ontology-Based Detection and Repairing Detection Method

In this section, we only consider data quality violations at the completeness (comission and omission) and correcteness levels. The principle we use to detect these violations are supported by the ontologies defined in Section 4 and the relational database R, e.g. the drug database extract from Figure 1. The main assumption of this method is the following. We consider that the database R presents overall good data quality. This is the reason why we designed the ontology enrichment from induction on this database. But as data related dependencies are not supported in R, some data violations may exist in R. Thus we are using the properties associated to the concepts of our ontology(ies) to detect data quality violations. The potential of this approach is interesting because tuples from R can generally be aggregated using different characteristics, e.g. therapeutic class, chemical substances for the drug database. We also believe that an efficient approach to design relevant groups are based on the use of the terminologies that supported the design of these ontologies, e.g. EphMRA and the ATC terminologies. We can view the relation between the ontologies and the relational database with a logical point of view. The schema part of a DL KB is typically called a TBox (terminology box) and is a finite set of universally quantified implications [2]. Most DLs can be considered as decidable fragments of first-order logic. Thus their axioms have an equivalent representation as first-order-formulae [5]. On the other hand, the schema of a relational database is defined in terms of relations and dependencies [16], also named integrity constraints. In [1], the authors explain that most dependencies can be represented as first-order formulae and have a dual role in relational databases : they describe the possible worlds as well as the states of the databases. Reiter also observed in [19] that integrity constraints are sentences about the state of the database and are not objective sentences about the world. As discussed in [17], although the expressivity of DLs underlying OWL and of relational dependencies is clearly different, the schema languages of the two are quite closely related. The interpretation of schemas in both DLs and relational databases are grounded in standard first-order semantics. In this semantics, a distinction is made between legal and illegal relational structures. A structure is legal when it satisfies all axioms defined in its schema, otherwise it is illegal. The terms used to denote legal structures in DLs and relational databases are different, respectively models and database instances. Whenever a relational database is updated, its dependencies are interpreted as check. If the check is satisfied, the database instance is modified accordingly

1128

O. Cur´e and R. Jeansoulin

otherwise the update is rejected. The behavior on the models of DLs first-order formulae is different [12],[17]. In our approach, we consider that each database instances respect a given set of integrity constraints. But we also require from these database instances to respect a set of dependencies expressed in some associated ontologies, e.g. EphMRA and ATC ontologies in our medical example. Thus we want the database instances to satisfy explicitly provided database schema integrity constraints as well as the DL dependencies. The mechanism proposed for the latter has to deal with the context of a domain where many exception can occur, e.g. the pharmacology domain. For example, in the case of processing a group of drugs based on a given EphMRA concept, we may encounter drugs which do not present the same set of contra-indications. This may be caused by the dosage of the drugs, an aspect we are currently trying to solve with the integration of the DDD system, or the presence of excipients, an issue we aim to address with the integration of rules in the ontologies. In the current solution, in order to deal with these exceptions, we must involve users in the process of repairing violations. Thus our detect and repair approach is semi-automatic : detection is automatic and repairing involves the validation of the end-user. These steps are facilitated by the design of a web interface which highlights possible violations for a set of drugs and propose a fast and easy way to correct them. We are actually proposing two graphical solutions to repair these violations: ontology concept-centric and database attribute-centric approaches. Ontology concept-centric approach In this approach two concepts from the ontology O are first selected. The first concept C1 corresponds to a code of the T erm relation while the second one, C2 , corresponds to one of the concepts created from the relations P rop, e.g. the concept ContraIndication. The validation of this selection causes the display of a matrix where columns correspond to instances of the C2 concept and the rows to database tuples, resulting from the execution of a system-generated SQL query, identified by a value x, i.e. drug products. At a row x and a column y, the matrice can be filled with 3 different values: (1) a ’x’ symbol indicates that the tuple of Ind identified by value x has the property identified by value y for the C2 concept, (2) a ’ ?’ highlights that the tuple does not have this property according to the data in the ontology, it should be the case, (3) an empty cell indicates that the drug does not have this property and this state holds with the ontology’s knowledge. In case (2), the end-user can click on the interrogation mark to automatical correct the database by inserting a new tuple in P ropInd that states that the database individual has this property. Figure 3 proposes an extract from this presentation for the contra-indication fields for the R05D2 EphMRA code. From such a matrix, it possible to click on a CIP number and to access the complete SPC of the drug. In this case, the composition field may help the health care professional to take a decision. Concerning possible database violations highlighted by interrogation marks in Figure 3, two violations are detected for the contra-indication #108 (productive cough): products identified

Data Quality Enhancement of Databases

1129

Fig. 3. Extract from an ontology concept-centric view: contra-indication matrix for the R05D2 EphMRA concept

by CIPs 3032035 and 3418154. This detection is processed assuming that drugs of this category R05D2 may all have the productive cough contra-indication. This aspect corrects data quality completeness related to comission. It is also possible to correct data quality completeness related to omission but deleting contra-indications in the SPC drug window. Database attribute-centric approach In a database attribute-centric approach an attribute of the database R, possibly not associated to an ontology concept, a concept created from a relation P rop and a set of available ontologies are selected. The selected attribute serves to design groups of database tuples and the set of ontologies enables to analyze this group according to the information stored in these ontologies. The results are displayed in a matrix similar to the one presented previously: columns are individuals of the concept C2 and rows are tuples from the created group. The cells of the matrix can again be empty or filled with ’x’ with the same interpretation as the previous approach. But the cells can also be filled with integer values that range from 2n − 1 with n the number of ontologies in the set. These values identify the elements of the powerset of n minus the empty set which is being dealt with the ’x’ symbol. Figure 4 proposes an extract for the contra-indication SPC field for the antitussive therapeutic class. Both the EphMRA and ATC ontologies have been selected thus values range for 1 to 3: – A value of 1 in a cell highlights a proposition made from the EphMRA ontology. This is the case for the contra-indication with value 109 and products identified with CIP 3481537 and 3371903. – A value of 2 in a cell highlights a detection made from inferences using the ATC ontology. – A value of 3 in a cell highlights that both ontologies (EphMRA and ATC) have detected this cell as a candidate for violation.

1130

O. Cur´e and R. Jeansoulin

Fig. 4. Extract from a database attribute-centric view: contra-indication matrix for antitussive therapeutic class

5.2

Evaluation

We propose to evaluate this method on several aspects: (1) improvement of the resulting drug database after a thorough detection and repairing step, (2) satisfaction of the team of health care professionals maintaining the database. The evaluation emphasizes the results of the first execution of the detection and repairing process. This is the most relevant results as it is the step where the enhancements are most clearly visible. As database updates are performed, the data quality improvements are less evident but still as effective. The first evaluation aspect of this method is to study the most prominent comparison criteria orginating from information retrieval: precision and recall. In logical terms, precision and recall correspond respectively to the correctness and completeness methods. Our evaluation studied the following SPC fields: contra-indications, drug interactions, cautions with allergies and diseases, cautions with other drug treatments, side-effects. The method we use proceeds as follows: during a detection and repairing session we evaluate the precision and recall for each row of our matrices. In such a setting, the precision is the measure of correctly found properties, ’x’ in a matrix (true positives), over the total number of properties, the number of columns in the matrix (true and false positives). The recall measures the ratio of the true positives over the total number of properties for a row (true positives and true negatives). Testing over a set of more than 2000 drugs, we were able to evaluate the average precision and recall to respectively 0,71 and 0,61. Even more important is the estimation of improvements over both our criteria after our induction-based repairing. These improvements are calculated after the validation of a matrix by a domain experts, possibly with some repairings, e.g. the end-user modified the set of contra-indications for a given drug and thus changing the number of true positives and/or false positives. Over the same set of drugs, the ameliorations for precision was 6% and 8% for recall. This rate of improvement shows that the original database was of a relative high quality, which is good viewed from the inductive reasoning aspect. But it also shows that this database’s data quality can be improved by our method. After each execution of the ontology-based detection and repairing, the

Data Quality Enhancement of Databases

1131

ontology is refined by modification/validation of the domain experts responsible for this task. Finally, another interesting feature is the system’s ability to check the data dependencies stored in the ontologies after each database updates. The second aspect which is quite interesting is the satisfaction of our health care professionals. They consider that the detection and repairing method is really user-friendly, as it only involves reading and clicking, and the learning is quite short, as the detection only requires a visual detection. The best proof of success is the high usage rate of this maintenance assistant by our team of domain experts. We believe that we can improve the end-user efficiency for the completeness comission factor which is less obvious and really involves a clear understanding of the grouping characteristic. The efficiency of the detection method depends on the quality of the grouping factor. We believe that the system can assist end-users in selecting pertinent groups where the number of individuals is relevant. We can stress that the fewer products in the group, the easier it is to reach the membership threshold θ without really being relevant.

6

Application to Other Domains

There are other situations where the problem can be formulated into the same framework that has been used in this paper. This framework can be sketched this way: – there exists some identifiable and observable item: here, an item is an individual ”drug”; – there exists some properties, in some identifiable domain, that can be attached to the above mentionned observable item: here, the effects of each drug can be observed with respect to some contra-indication list, through series of more or less quantifiable observations; – there exists a general taxonomy of individual components, whose lower level is made of individuals, which we can name atoms: here, an atom is a ”chemical molecule”; – finally, there exists a partonomy of items in terms of atoms: here, this is the chemical composition of each drug as an association of molecules. The enrichment procedure tries to propagate the appropriate properties from the atoms, up to the top of the hierarchy, pointing which contra-indications are most likely. Let’s consider the situation of landscape description and analysis: – items are individual polygonal zones that we can observe, for instance from space; – atoms are land parcels, stored as administrative or agricultural parcels, within some geographical database; – the taxonomy of parcels is the hierarchy of space partitions at several levels (county, state, etc.) or any landscape sub-division;

1132

O. Cur´e and R. Jeansoulin

– the partonomy describes the observed polygonal zones in terms of union of parcels (or intersecting parts); – the observable properties can be crop classes, vegetation state, etc. The enrichment procedure tries to propagate the appropriate properties from the atoms, up to the top of the hierarchy, describing which is the prominent land-use at any level, in terms of set of vegetation properties, etc. We can also introduce a second taxonomy on the observable properties, and propagate at each generalisation level, which is the most contributing part of space, in terms of set of space atoms. Finally, the two enrichments, on the space taxonomy and on the property taxonomy, can be performed in a coordinate movement, and can be used for controlling the overal consistency. Even if this is propably a computational challenge, it is one of perpective for future investigation.

7

Related Work

The research work in data cleaning based on constraints are related to this work. A central aspect of these researches have been introduced in [3] where the notion of a repair is defined as the action to find another database that is consistent and minimally differs from the original database. In [3] repairs are provided by means of insertions and deletions while [23] proposes tuple updates as a repair primitive. Recently [4] introduced the notion of conditional functional dependencies (CFD) which is an extension of traditional functional dependencies and captures a fundamental part of the semantics of the data. The main objective of CFDs is to propose a new tool for data cleaning and toward this goal, [4] provides techniques and a sound and complete inference systems for their consistency and implication analyses. SQL-based techniques for detecting inconsistencies as violations of CFDs have been tested but the authors argue that much more work has to be done on several aspects, in particular on discovering CFDs. Like this work, our approach aims to store a new form of data dependencies to repair/clean existing databases but our approach focuses on storing these dependencies in DLs and much attention is given in terms of discovering the dependencies using inductive reasoning on the data. The inductive approach has been used in the domain of Knowledge Discovery in Databases (KDD), in particular, to find classifications rules. Classification systems create mappings from data to predefined classes, based on features of the data. Many of the solutions adopting this approach are based on decision tree technology introduced in [18]. The obvious exploitation of this approach is to design a classification of concept based on data induction. Another relation between ontologies and such approaches is to exploit ontologies as background knowledge to enhance data mining applications [21]. So to our knowledge, this approach is a first step toward transforming existing classifications into expressive ontologies by inducing new concept properties.

Data Quality Enhancement of Databases

8

1133

Conclusion

We presented in this paper a simple yet effective and user-friendly solution to enhance the data quality of relational databases. In the process of repairing these databases, our solution enables to construct domain ontologies from available terminologies and classifications. We demonstrated this solution on the medical domain with drug databases. We believe that this method can be generalized to other domains where terminologies are accessible. In future, we aim to test our solution in the geographical domain with the Corine land cover hierarchy and the subsets of the United Nations Standard Products and Services Code (UNSPSC) products and services classification. Another extension we are working on is related to mapping ontologies using the inductive approach described. Actually, it is really easy to map the EphMRA and ATC ontologies because their concepts are both defined using the same set of terms, e.g. contra-indication, caution, side-effect terms. Finally, the extension that may have the biggest potential in the pharmaceutical domain is trying to automatize the generation of SPC from these and other ontologies. We think that the integration of the Defined Daily Dose (DDD) which is generally associated to the ATC in order to define drug posology.

References 1. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison Wesley, Reading (1995) 2. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P.F. (eds.): The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, Cambridge (2003) 3. Bertossi, L., Chomicki, J.: Query Answering in Inconsistent Databases Chapter in book. In: Chomicki, J., Saake, G., van der Meyden, R. (eds.) Logics for emerging applications of databases, Springer, Heidelberg (2003) 4. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietdsidis, A.: Conditional functional Dependencies for Data Cleaning 5. Borgida, A.: On the Relative Expressiveness of Description Logics and Precidate Logics. Artificial intelligence 82(1-2), 353–367 (1996) 6. Brachman, R.J.: What IS-A is and isn’t: an analysis of taxonomic links in semantic networks. IEEE Computer 16, 30–36 (1983) 7. Cimino, J.J., Zhu, X.: The practical impact of ontologies on biomedical informatics IMIA Yearbook of Medical Informatics, pp. 1-12 (2006) 8. Cur´e, O., Squelbut, R.: A database trigger strategy to maintain knowledge bases developed via dat a migration. In: Bento, C., Cardoso, A., Dias, G. (eds.) EPIA 2005. LNCS (LNAI), vol. 3808, pp. 206–217. Springer, Heidelberg (2005) 9. Cur´e, O.: Ontology Interaction with a Patient Electronic Health Record. In: Proceedings of 18th IEEE Symposium on Computer-Based Medical Systems, pp. 185– 190 (2005) 10. Cur´e, O., Squelbut, R.: Integrating data into an OWL Knowledge Base via the DBOM Protplug-in. In: Proceedings of the 9th International Prot´eg´e conference (2006)

1134

O. Cur´e and R. Jeansoulin

11. Dean, M., Schreiber, G.: OWL Web Ontology Language Reference. W3C Recommendation (2004) 12. de Bruijn, J., Lara, R., Polleres, A., Fensel, D.: OWL DL vs. OWL flight: conceptual modeling and reasoning for the semantic Web. In: Proceedings of 14th international conference on World Wide Web, pp. 623–632 (2005) 13. Gennari, J., Musen, M., Fergerson, R., Grosso, W., Crubezy, M., Eriksson, H., Noy, N., Tu, S.: The evolution of protege: an environment for knowledge - based systems development. International Journal of Human - Computer Studies 123, 58–89 (2003) 14. Hepp, M., de Bruijn, J.: GenTax: a gerenric methodology for deriving OWL and RDF-S ontologies from hierarchical classifications thesauri, and inconsistent taxonomies. In: Proceedings of the European Semantic Web Conference (to appear, 2007) 15. Horrocks, I., Sattler, U.: A Tableaux Decision Procedure for SHOIQ. In: Proc. of IJCAI 2005, pp. 448–453 (2005) 16. Kanellakis, P.C.: Elements of relational database theory. In: Handbook of theoretical computer science (vol. B): formal models and semantics, pp. 1073–1156. MIT Press, Cambridge (1990) 17. Motik, B., Horrocks, I., Sattler, U.: Bridging the gap between OWL and relational databases. In: Proceedings of the 16th International World Wide Web Conference, to appear (to appear 2007) 18. Quinlan, J.R.: Induction of Decision Trees. In: Readings in Machine Learning, pp. 81–106. Morgan Kaufamn, San Francisco (1990) 19. Reiter, R.: What Should a Database Know? Journal of Logic Programming 14(1-2), 127–153 (1992) 20. Shvaiko, P., Euzenat, J.: A Survey of Schema-Based Matching Approaches. Journal of Data Semantics IV, 146–171 (2005) 21. Taylor, M., Stoffel, K., Hendler, J.: Ontology-based Induction of High Level Classification Rules. Research Issues on Data Mining and Knowledge Discovery (DMKD) (1997) 22. WHO Collaborating Centre for Drug Statistics Methodology URL of Web site: http://www.whocc.no/atcddd/ 23. Wijsen, J.: Condensed representation of database repairs for consistent query answering. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 378–393. Springer, Heidelberg (2002)

Suggest Documents