U NIVERSIT A` DEGLI S TUDI DI ROMA “L A S APIENZA” D OTTORATO DI R ICERCA IN I NGEGNERIA I NFORMATICA XIX C ICLO – 2006
U NIVERSIT E´ DE PARIS S UD D OCTORAT DE R ECHERCHE EN I NFORMATIQUE
Structured and Semi-Structured Data Integration
Antonella Poggi
U NIVERSIT A` DEGLI S TUDI DI ROMA “L A S APIENZA” D OTTORATO DI R ICERCA IN I NGEGNERIA I NFORMATICA XIX C ICLO - 2006
U NIVERSIT E´ DE PARIS S UD D OCTORAT DE R ECHERCHE EN I NFORMATIQUE
Antonella Poggi
Structured and Semi-Structured Data Integration
Thesis Committee
Reviewers
Prof. Maurizio Lenzerini (Advisor (Italy)) Prof. Serge Abiteboul (Advisor (France))
Prof. Bernd Amann Prof. Alex Borgida
Prof. Riccardo Rosati
AUTHOR ’ S ADDRESS IN I TALY: Antonella Poggi Dipartimento di Informatica e Sistemistica Universit`a degli Studi di Roma “La Sapienza” Via Salaria 113, I-00198 Roma, Italy AUTHOR ’ S ADDRESS IN F RANCE : Antonella Poggi Departement d’Informatique Universit´e de Paris Sud Orsay Cedex, France E - MAIL : WWW:
[email protected] http://www.dis.uniroma1.it/∼poggi/
To Mario
Acknowledgements Everything started one day of June 2000, when I decided to go Erasmus at the Ecole Polytechnique in Paris, and I was suggested to attend there the databases lectures given by Prof. Abiteboul. When in January 2001, I decided to follow the extralectures of Prof. Abiteboul in order to present a databases project, he was so nice to sit beside me and teach me how to write my first HTML page: my homepage. This was his way to introduce me to XML! One first internship at I.N.R.I.A. was my first great experience with research, and from that day on, I never gave up “dreaming research”. When I came back home, I met Maurizio (whose lectures were the most exciting I ever had) and thank to him and Serge I could participate to VLDB as a volunteer (Rome, Sept. 2001). How could someone resist loving research after such a wonderful conference? Then, I finished my exams and Maurizio supported me to come back to I.N.R.I.A. for a second internship (my final project which lead to my graduation thesis). On my return, I started collaborating with Maurizio, and he made me love databases theory and data integration issues so much, that I chose to start my PhD route... Thanks to an European initiative, and thanks to both my advisors who had to fight against italian and french bureaucracy, I had the opportunity to do my research in joint work between the roman and the parisian databases groups. This was not always so easy... But I was so lucky to find such great researchers, both able to have such an amazing “big picture”! They were both mentors, fathers and friends. No word may express how much I would like to thank you both, Maurizio and Serge. I can only say once again: “Grazie - Merci” (as I am now used to concluding all my research talks). I will miss so much being such a favoured PhD student! Of course, these aknowlegements cannot end without thanking my sweet husband Mario, and my family. Both have been so patient, understanding... and, above all, they have always been “with” me. I love you, and will always do.
iii
Contents I
Antechamber
1
Theoretical foundations of DIS 1.1 Logical framework . . . . . . . . . . . . . . . . . . . . 1.2 Consistency of a DIS . . . . . . . . . . . . . . . . . . . 1.3 Query answering over DIS . . . . . . . . . . . . . . . . 1.4 Updates over DIS . . . . . . . . . . . . . . . . . . . . . 1.5 Relationship with databases with incomplete information
2
1
State of the art of DIS 2.1 Commercial “data integration” tools 2.2 Global picture of the state of the art 2.3 Main related DIS . . . . . . . . . . 2.3.1 LAV approach . . . . . . . 2.3.2 GAV approach . . . . . . . 2.3.3 GLAV approach . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
. . . . . .
. . . . .
5 5 7 8 8 10
. . . . . .
11 11 12 13 15 16 19
II Ontology-based DIS 3
4
The language 3.1 DL-LiteF RS . . . . 3.1.1 DL-LiteF RS 3.1.2 DL-LiteF RS 3.1.3 DL-LiteF RS 3.1.4 DL-LiteF RS 3.2 Query language . . 3.3 DL-LiteA . . . . .
21 . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
27 27 28 29 31 34 36 37
DL-LiteA reasoning 4.1 Storage of a DL-LiteA ABox . . . . . . . . . . . . . 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . 4.2.1 Minimal model for a DL-LiteA ABox . . . . 4.2.2 Canonical interpretation . . . . . . . . . . . 4.2.3 Closure of negative inclusions . . . . . . . . 4.3 Satisfiability of a DL-LiteA KB . . . . . . . . . . . . 4.3.1 Foundations of the algorithm for satisfiability
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
43 43 44 44 44 48 50 50
. . . . . . . . . expressions . . TBox . . . . . ABox . . . . . knowledge base . . . . . . . . . . . . . . . . . .
v
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . .
54 57 57 58
. . . . . . . . . . . . .
63 63 63 64 68 68 69 70 71 73 73 79 80 82
Updates of Ontologies at the Instance Level 6.1 The DL-LiteF S language . . . . . . . . . . . . . . . . . . . . . . . 6.2 Instance-level ontology update . . . . . . . . . . . . . . . . . . . . 6.3 Computing updates in DL-LiteF S ontologies . . . . . . . . . . . . .
85 85 86 88
4.4
5
6
4.3.2 Satisfiability algorithm . . . . . . . . . . . Query answering over DL-LiteA KB . . . . . . . . 4.4.1 Foundations of query answering algorithm 4.4.2 Query answering algorithm . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
Consistency and Query Answering over Ontology-based DIS 5.1 DL-LiteA ontology-based DIS . . . . . . . . . . . . . . . 5.1.1 Linking data to DL-LiteA objects . . . . . . . . . . 5.1.2 Logical framework for DL-LiteA DIS . . . . . . . 5.2 Overview of consistency and query answering method . . 5.2.1 The notion of virtual ABox . . . . . . . . . . . . . 5.2.2 A “naive” bottom-up approach . . . . . . . . . . . 5.2.3 A top-down approach . . . . . . . . . . . . . . . . 5.2.4 Relevant notions from logic programming . . . . . 5.3 DL-LiteA DIS consistency and query answering . . . . . . 5.3.1 Modularizability . . . . . . . . . . . . . . . . . . 5.3.2 Consistency algorithm . . . . . . . . . . . . . . . 5.3.3 Query answering algorithm . . . . . . . . . . . . . 5.3.4 Computational complexity . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . .
. . . .
. . . . . . . . . . . . .
III XML-based DIS 7
The setting 7.1 Data model . . . . . . . . . . . 7.2 Tree Type . . . . . . . . . . . . 7.3 Constraints and schema language 7.4 Prefix Queries . . . . . . . . . .
97 . . . .
103 103 110 111 113
8
XML-based DIS 8.1 XML DIS logical framework . . . . . . . . . . . . . . . . . . . . . 8.2 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 XML DIS consistency . . . . . . . . . . . . . . . . . . . . . . . .
117 117 119 126
9
XML DIS query answering 9.1 Lower-bound for query answering under exact mappings 9.2 Incomplete trees . . . . . . . . . . . . . . . . . . . . . . 9.3 Query answering using incomplete trees . . . . . . . . . 9.4 Query answering algorithms . . . . . . . . . . . . . . . 9.4.1 Algorithm under VKR and no key constraints . . 9.4.2 Algorithm under IdG sound and complete . . . .
133 133 134 137 139 139 142
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . .
. . . . . .
. . . . . .
Conclusion
149
Bibliography
158
Part I
Antechamber
1
Data integration is a huge area of research that is concerned with the problem of combining data residing at heterogeneous, autonomous and distributed data sources, and providing the user with a unified virtual view of all this data. Today’s fast and continuous growth of large business organizations, often deriving from mergers of smaller enterprises, enforces an increasing need in integrating and sharing large amounts of data, coming from a number of heterogeneous and distributed data sources. Such needs are also shown by others applications, like information systems for administrative organizations, life sciences research and many others. Moreover, it is not infrequent that different parts of the same organization adopt different systems to produce and maintain critical data. Clearly, data integration is a challenge in all these kinds of situations. Furthermore, it has become even more attractive thanks the ubiquitous spread of the World Wide Web and access to information it provides. Hence, during the last decade, research and business interest has migrated from DataBase Management Systems, DBMS (Codd, 70’s [37]), to Data Integration Systems (DIS). Whereas the former make a unique local data source accessible through a schema, the latter offer the necessary framework to combine the data from a set of heterogeneous and autonomous sources through a so-called global schema (or, mediated schema). Thus, the global schema does not contain data by itself, but provides a reconciled, integrated and virtual view of the underlying sources, which in contrast contain the actual data. We insist that since the global schema acts as the interface to the user for accessing data, the choice of the language for expressing and querying such a schema is crucial. In particular, whereas research on the topic has already provided several DIS, rather few of them represent an appropriate trade-off between the expressive power of the languages for specifying the global schema and querying the system, and the efficiency of query answering. Nevertheless, both these aspects deserve to be simultaneously considered. Indeed, the issue of providing an important set of semantic constraints over the global schema becomes more and more crucial, as one wants to use rather basic conceptual modeling constructs for his application. On the other hand, offering an expressive query language and allowing for efficient query answering over typically large amounts of data are obvious requirements of such kind of systems. In this thesis, we focus on the study of hierarchical DIS, where the global schema acts as a client of the data sources as opposed to Peer-to-Peer DIS, where the global schema acts both as a client and a server for other DIS. In particular, motivated by the challenges discussed above, we investigate both structured and semi-structured data integration, in the two major contexts of ontology-based data integration, and XMLbased data integration. On the one hand, ontology-based DIS are characterized by a 3
4 global schema described at the intensional level of an ontology, i.e., the shared conceptualization of a domain of interest. The main issue here is that typically ontology languages are extremely costly with respect to the size of the data. Notably, we propose a setting where answering queries over the ontology-based DIS is L OG S PACE in data complexity. On the other hand, XML-based DIS are characterized by an expressive global schema. This is a novel setting, not much investigated yet. The main issue here concerns the presence of a significant set of integrity constraints expressed over the schema, and the concept of node identity which requires particular attention when data come from autonomous data sources. In particular, in both contexts, our contribution consists in approaching formally the following issues. ❒ The modeling issue, which requires to provide the user with all he needs for modeling the DIS. More precisely, he will be given (i) a language for specifying the global schema, (ii) a language for specifying the set of source schemas, and (iii) a formalism to specify the relationship existing between the data at the sources and the elements of the global schema. ❒ The query answering issue, which is concerned with the basic service offered by a DIS, namely the ability of answering queries posed over the DIS global schema. We provide an appropriate query language and algorithms to answer queries posed to the DIS. Also, we study the complexity of the problem in both contexts, under a variety of assumptions for the DIS specification. ❒ Since sources are in general autonomous, we also investigate the problem of detecting inconsistencies among data sources, a problem which is mostly of the time ignored in DIS research, thus resulting in a quite unrealistic setting. ❒ Finally, we begin the investigation of the update of DIS, in the context of Ontology-based DIS. This concerns the problem of accepting updates expressed in terms of the global schema, aiming at reflecting them by changes at the source data level. This is the first investigation we are aware of that goes in this challenging direction. Our research has been carried out under the joint supervision of the Department of Computer Science of the University of Rome ”La Sapienza” and the GEMO INRIAFuturs Project, resulting from the merger of INRIA-Rocquencourt Verso Project and the IASI group of the University of Paris-Sud. The thesis is organized as follows. The first part of the thesis serves as an introduction to the theoretical foundations of our approach to DIS, and a motivation for it. Then, the second part of the thesis will be devoted to ontology-based DIS examination, while the third part will concern with XML-based DIS.
Chapter 1
Theoretical foundations of DIS In this chapter, we introduce the main theoretical foundations underlying our investigation of DIS [63]. Specifically, we start by setting up a logical framework for data integration. Then we present the main issues related to DIS that will be the focus of our attention, namely consistency checking and query answering. Afterwards, we introduce the problem of performing updates over DIS. Finally, we discuss the relationship existing between DIS and databases with incomplete information [58].
1.1 Logical framework As already mentioned, in this work, we are interested in studying DIS whose aim is combining data residing at different sources, and providing the user with a unified view of these data. Such a unified view is represented by the global schema. Thus, one of the most important aspects in the design of a DIS is the specification of the correspondence between the data at the sources and the elements of the global schema. Such a correspondence is modeled through the notion of mapping. It follows that the main components of a data integration system are the global schema, the sources, and the mapping. Thus, we formalize a data integration system Π in terms of a triple hG, S, Mi, where ❒ G is the global schema, expressed in a language LG over an alphabet AG . The alphabet comprises a symbol for each element of G (i.e., a relation if G is relational, a concept or a role if G is a Description Logic, a label if G is an XML DTD, etc.). ❒ S is the source schema, expressed in a language LS over an alphabet AS . The alphabet AS includes a symbol for each element of the sources. ❒ M is the mapping between G and S, consisting of a set of assertions M , each having the form (qS , qG , as), or (qG , qS , as) where qS and qG are two queries of the same arity, respectively over the source schema S, and over the global schema G, and as may assume the value sound, 5
CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS
6
complete or exact. Queries qS are expressed in a query language LM,S over the alphabet AS , and queries qG are expressed in a query language LM,G over the alphabet AG . On the other hand, the value as models the accuracy of the mapping. Note that the definition above has been taken from [63], and it is general enough to capture all approaches in the literature, including in particular the DIS considered in this thesis. We call database a set of collections of data. We say that a source database (also referred to as a set of data sources) D = {D1 , · · · , Dm } conforms to a schema S = {S1 , · · · , Sm } if Di is an instance of Si for i = 1, · · · , m (where clearly the notion of Di being an instance of Si depends on the language LS for expressing S). Moreover, we call global database an instance of the global schema G 1 over a domain Γ. Thus, given a set of sources D conforming to S, we call a set of legal databases for Π w.r.t. D, denoted sem(Π, D), the set of databases B such that: ❒ B is a global database, and ❒ B satisfies the mapping M w.r.t. D. Clearly, the notion of B satisfying M w.r.t. D depends on the semantics of the mapping assertions. Intuitively, the assertion (qS , qG , as) means that the concept represented by the query qS over the sources D, corresponds to the concept in the global schema represented by the query qG , with the accuracy specified by as. Formally, let q be a query of arity n and DB a database. We denote with q DB the set of n-tuples in DB that satisfy q. Then, given a set of data sources D conforming to S and a global database B, we say that B satisfies M w.r.t. D, if for each Mi in M of the form (qS , qG , as) we have that: ❒ if as = sound, then qGB ⊇ qSD ; ❒ if as = complete, then qGB ⊆ qSD ; ❒ if as = exact, then qGB = qSD . Typically sources in DIS are considered sound. This will also be the assumption we make in the investigation of ontology-based DIS. In contrast, in the XML-based context, we will study also the case of exact mappings, which appear to be useful when one considers a data source as an “authority” providing exactly all the information about a certain topic. On the other hand, we do not consider the case of complete mappings, since it appears less interesting in practice. Note that different forms for mappings have lead to the following characterization of the approaches to data integration in the literature [53]: ❒ In the Local-As-View (LAV) approach, mappings in M have the form (s, qG , as), where s in an element of the source schema. 1
In particular, in this thesis, we consider the case of a global database being a first-order logic model (∆ , ·I ) of G, if G is the intensional level of a Description Logic (DL) [21] ontology, or an XML document satisfying G, if G is a DTD provided with a set of integrity constraints. I
1.2. CONSISTENCY OF A DIS
7
❒ In the Global-As-View (GAV) approach, they have the form (qS , g, as), where g in an element of the global schema. ❒ In the Global-and-Local-As-View (GLAV) approach, no particular assumption is made on the form of mappings. Clearly, the LAV approach favors the extensibility of the system, since adding a new source simply requires enriching the mapping with a new assertion, without other changes. On the other hand, the GAV approach has a more procedural flavor, since it tells the system how to use the sources to retrieve the data. Before concluding this presentation of the logical framework for data integration, we observe that, no matter which is the interpretation of the mapping, in general, several global databases exist that are legal for Π with respect to D. This observation motivates the relationship between data integration and databases with incomplete information [86], which will be discussed in Section 1.5.
1.2 Consistency of a DIS Given a data integration system Π = hG, S, Mi and a set of sources D conforming to S, it may happen that no legal database exists satisfying both the global schema constraints and the mappings w.r.t. D, i.e. sem(Π, D) = ∅. Then, we say that the system is inconsistent w.r.t. D. It is worth noting that this kind of situation is particularly critical, since as we will see, it makes query answering become meaningless. Despite its importance, this situation is often blurred out in data integration systems, or dealt with by means of a-priori and ad-hoc transformations and cleaning procedures to be applied to data retrieved from the sources (e.g.[44]). Here we address the problem from a more theoretical perspective. In particular, we believe that the first step to deal with inconsistencies is obviously to detect whether there are some it occurs. Thus, we study the problem of deciding whether a system is consistent w.r.t. a set of data sources. Such a problem can be formulated as follows: PROBLEM : INPUT :
DIS CONSISTENCY A data integration system Π = hG, S, Mi, a set of data sources D conforming to S QUESTION : Is there a database B legal for Π w.r.t. D?
In both Ontology-based and XML-based DIS, we will study DIS consistency, show it is decidable, examine its complexity and provide algorithms to solve it. However, we do not consider in this thesis the problem of reconciling the data at the sources, i.e. modifying the data retrieved from the sources so that the system becomes consistent. This is a one challenging issue that we intend to address in the future.
CHAPTER 1. THEORETICAL FOUNDATIONS OF DIS
8
1.3 Query answering over DIS The basic service that is offered by a DIS is query answering, i.e. the ability of answering queries that are posed in terms of the global schema G and are expressed in a language Lq over the alphabet AG . Given a DIS Π = hG, S, Mi and a set of data sources D conforming to S, the certain answers q(Π , D) to a query q posed over Π w.r.t. D, is the set of tuples t of elements in Γ (i.e., the domain of the instances of G) such that t ∈ q B for every legal database B w.r.t. Π, or equivalently: q(Π , D) = {t | t ∈ q B , ∀B ∈ sem(Π, D)} Note that q(Π , D) are called certain answers to q in Π w.r.t. D. Query answering can be tackled under two different forms. In particular, under the so-called recognition form, it is formulated as follows: PROBLEM : INPUT :
QUERY ANSWERING (RECOGNITION) Consistent data integration system Π = hG, S, Mi, set of data sources D conforming to S, query q, and tuple t of elements of Γ QUESTION : Is t in q(Π , D)?
Other times, query answering assumes a more ambitious form and aims at finding the entire set of certain answers. Thus, it is formulated as follows: PROBLEM : INPUT :
QUERY ANSWERING (FULL SET) Consistent data integration system Π = hG, S, Mi, set of data sources D conforming to S, query q QUESTION : Find all t such that t ∈ q(Π , D).
As for DIS consistency, in our investigation, we will study DIS query answering under different assumptions, show it is decidable, examine its complexity and provide algorithms to solve it. Note in particular that in both the formulations for the query answering problem, we assume to have a consistent DIS. Indeed, in this thesis, we are not concerned with the problem of answering queries in the presence of mutually inconsistent data sources. One possibility to address such a problem is to follow an approach in the spirit of [62], where the authors advocate the use of an “approximate” semantics for mappings.
1.4 Updates over DIS In this section, we introduce write-also DIS, i.e. DIS that allow for performing updates expressed over the global schema. Several approaches to update have been proposed in the literature, see, e.g.,[39] for a survey. In particular, different change
1.4. UPDATES OVER DIS
9
operators are appropriate depending on whether it is a revision [20], i.e. a correction to the actual state of beliefs, or of an update [88], reflecting a change in the world. In this section, even though we use the term “update”, we do not aim at advocating the use of one particular approach. On the contrary, we assume to have an arbitrary operator ◦. Moreover, we assume to have an update F expressed as a formula in terms of G, which intuitively is sanctioned to be true in the new state, i.e. it is “inserted” in the updated DIS specification. Thus, given a DIS Π = hG, S, Mi, a set of data sources D conforming to S, and the update F, we have that once ◦ is applied with F to the set of legal database for Π w.r.t. D, we obtain a new set of databases, however characterized, reflecting the change F. Note that we are interested in instance-level updates. This means that we assume that the specification of Π is invariant, whereas the update reflects a change that occurs at the sources D. Thus, in particular, we consider an update of Π with a set F of facts having the form g(t) where t is a n-tuple of elements of Γ and g is an element of G, meaning that the change consists in t being an instance of g. Thus, we formulate the problem of updating a DIS as follows: PROBLEM : INPUT :
EXPRESSIBLE UPDATE Consistent data integration system Π = hG, S, Mi, set of data sources D conforming to S, set of facts F QUESTION : Is there D′ such that sem(Π, D′ ) = sem(Π, D) ◦ F?
The above formulation is general enough to capture all approaches to update that have been proposed in the literature. However, it raises at least the following considerations. ❒ Typically the user of a DIS is not the owner of the data sources and thus he has not the right to modify their content. This is probably the reason why, as far as we know, DIS update has not been considered yet as an issue. However, we believe that a DIS should be able to possibly provide the appropriate infrastructure to allow the user to perform an instance-level update without changing the data at the sources. This could be achieved, for instance by using internal “proprietary” sources. ❒ What if no set of data sources exists solving the update problem formulated above (not even “proprietary” sources)? As usual, one possibility would be to “relax” the semantics of the update. Indeed, we might be interested in reasoning, e.g., answering queries, over the DIS resulting from the update. Indeed, to do so we do not necessarily need to materialize a new set of data sources, but actually we could reason on the original DIS by taking into account the update in a “virtual” way. In a sense, this is analogous to the distinction between projection via regression vs. progression in reasoning about actions [83]. Both the considerations above have motivated the beginning of our work on DIS update. Until now, we started tackling the problem for Ontology-based DIS (cf. Chapter 6).
10
CHAPTER 1
1.5 Relationship with databases with incomplete information Before concluding this introductory chapter on the theoretical foundations of our approach to data integration, we briefly discuss the strong connection existing between DIS and databases with incomplete information. Specifically, a database with incomplete information can be viewed as a set of possible states of the real-world. Similarly, given a set of data sources, a DIS represents a set of possible databases. Thus, when a query is posed over a database with incomplete information or a DIS, the problem arises of posing the query over a possibly infinite set of database states. It follows that in order to solve query answering over a DIS, one possibility is to find a finite representation of the set of possible databases and to provide algorithms to answer queries over such a representation. Indeed, this is the main idea underlying both the works presented in this thesis. Note, in particular, that this approach recalls the approach proposed in a landmark paper by Imielinski and Lipski [58], that consists in answering queries over a database with incomplete information, by exploiting the notion of representation system. Moreover, interestingly, in [4], the same approach is extended to deal with updates over databases with incomplete information.
Chapter 2
State of the art of DIS As already discussed, data integration has appeared as a pervasive challenge in the last decade. Such a success recalls the crucial impact of DBMS, proven by the large number of DBMS scattered all around the world. However, while the success of relational DBMS represents a great exception in the usual bottom-up process of emerging technologies, since it had been preceded by a deep understanding and a wide acceptance of the relational model and the related theory, the interest in data integration systems grew contemporaneously in both the business and research community. In particular, it lead to the implementation of systems, without having yet a deep understanding of all the intricate issues related, involving design time as well as run time aspects [54]. Clearly, it would be unrealistic to aim at being comprehensive while discussing the state of the art of such a huge field. Thus, in this chapter, we start by briefly discussing the commercial solutions to the need for integrating data. Afterwards, we contextualize our contribution into the global picture of the state of the art in data integration research field. Finally, according to such a global picture, we discuss more in details works that are most closely related to our investigation.
2.1 Commercial “data integration” tools Recently, some software solutions to the need for integrating data has emerged, suggesting the adoption of a DBMS as a kind of middleware infrastructure that uses a set of software modules, called wrappers, to access heterogeneous data sources [51]. Wrappers hide the native characteristics of each source, masking them under the appearances of a common relational table. Furthermore, their aim is to mediate between the federated database and the sources, mapping the data model of each source to the federated database data model, also transforming operation over the federated database into requests that the source can handle. Examples of commercial products following this kind of approach are Oracle Integration [75] and DB2 Information Integrator (DB2II)[74]. Obviously, both are based on the use of Oracle and IBM DBMS respectively. Even though remarkable from the point of view of the number of different types of data sources supported, as well as from the point of view of query optimizations, 11
12
CHAPTER 2. STATE OF THE ART OF DIS
these products are essentially data federation tools that are still far from data integration systems theory as it is by now well-established in the scientific databases community. Indeed, as we argued in [81], they actually allow the user to combine data coming from heterogeneous and autonomous sources, but do not provide the user with a unified view that is (logically) independent of the sources. It is worth noticing however, that data federation tools can be used as the essential underlying environment on top of which one can build a DIS. In particular, we show in [81] how to implement a DIS based on a relational schema by means of a commercial tool for data federation. In a nutshell, this is obtained by: (i) producing an instance of a federated database through the compilation of a formal DIS specification as formalized in the previous chapter; (ii) translating the user queries posed over the global schema, so as to issue them to the federated database. Even though interesting in order to highlight the mismatch between commercial products and research prototypes currently available, clearly, this approach is far from solving the main challenge addressed in this thesis, since it allows for a limited expressive power of the global schema (without constraints) and requires to follow a GAV approach.
2.2 Global picture of the state of the art In this section, we aim at giving a global picture of the state of the art in data integration and at contextualizing our contribution with respect to this global picture. From the previous chapter, it follows that a DIS specification depends on the following aspects: ❒ the data model chosen for the global database; ❒ the language used to express the global schema, i.e. the set of constraints characterizing it; ❒ the approach followed to specify the mapping, i.e. GAV, LAV or GLAV; ❒ the accuracy of the mappings (or equivalently of the data sources), i.e. sound, or exact (as we already argued complete mappings are less interesting in practice). Another aspect deserving to be considered when classifying DIS, is the architectural paradigm used. As already mentioned, in this thesis, we focus on hierarchical DIS, where it is possible to clearly distinguish between two different roles played on one hand by the global schema, that is accessed by the user and which does not contain by itself data, and on the other hand by the underlying sources, that contain the actual data. Another paradigm is recently emerging for DIS, as well as for other distributed systems, namely the Peer-To- Peer (P2P) paradigm. Put in an abstract way, P2P DIS are characterized by an architecture consisting of various autonomous nodes (called peers) which hold information, and which are linked to other nodes by means of mappings. Each node provides therefore part of the overall information available from a distributed environment and acts both as a client and as a server in the system, without relying on a single global view. However, in some sense, P2P data integration
2.3. MAIN RELATED DIS
13
systems can be considered as the natural extension of hierarchical data integration systems, since each node of the system may by itself be considered as an extended hierarchical DIS, that includes, besides the mapping to local data sources, an external mapping to other nodes schemas 1 . Note that since research in P2P data integration is still quite young, no commercial product really emerged yet. Fig. 2.1 summarizes the state of the art in data integration. More precisely, it classifies the main integration systems according to the features discussed above. Thus, it stresses systems that are closest to our investigation and can be therefore compared with our study. In the next two sections we describe some of these systems, focusing on those whose global schema is specified by means of (i) a Description Logic (and thus can be considered as DIS based on the relational model, characterized by a significant set of semantic constraints), and (ii) XML 2 (and thus a semi-structured data model). It is worth noting that, in Fig. 2.1, we do not consider Data Warehousing Systems nor Data Exchange Systems, which even though related to DIS, are based on a different form of data interoperability. Indeed, their aim is to export a materialized instance of the global schema, whereas DIS are characterized by a global schema that is virtual. In particular, Data exchange is the problem of moving and restructuring data from a generally unique data source to the global schema schema (called target schema), given the specification of the mapping (called source-to-target dependencies) between the source and the schema. Data exchange has become an active research topic recently due to the increased need for exchange of data in various formats, typically in e-business applications[9]. Papers [41, 40] laid the theoretical foundation of exchange of relational data, and several follow-up papers studied various issues in data exchange such as schema mapping composition[11].
2.3 Main related DIS In order to present main DIS that are closest to the work studied in this thesis, we next discuss those systems that are most comparable to our investigation, because e.g. of the expressivity of the global schema (cf. Fig. 2.1). In particular, we classify such systems on the basis of the approach followed for mappings specification. Note that despite the great increasing interest in XML from both business and research, little previous work has addressed XML-based data integration issue, as defined and studied here. In contrast, considerable work has addressed XML publishing systems and some initial work has focused on basic theoretical XML data exchange issues. Both these kinds of work are somehow orthogonal to our investigation since, besides assuming to materialize the global schema, they consider a unique data source. Hence, they were not presented in Fig. 2.1. However, in the XML setting, where not much work has addressed even basic data integration issues, they appear as relevant. Thus, we will present some of them. 1 Clearly, this is only an abstraction since the possible presence of cycles among peers complicates notably P2P DIS and introduces new challenging issues (see e.g. [28]). 2 Reader is assumed to be familiar with notation and terminology of the relational model [5], XML [2] and DLs [14].
14
Data model
Constraints
Hierarchical Hierarchical Hierarchical
Relational Relational Relational
Inclusions,... Inclusions,... Functional, inclusions
Hierarchical Hierarchical
Semi-structured Semi-structured
Hierarchical Hierarchical Hierarchical
Object-oriented XML XML
P2P
Relational
P2P
XML
P2P
XML
keys DTD XML Schema types and functional ... keys, foreign keys
Keys
Mapping approach LAV GAV GAV
Mapping accuracy sound sound sound
GAV LAV LAV LAV GLAV
sound, exact, sound sound sound sound
GLAV
sound
[32]
GLAV
exact, sound exact, sound
Piazza [55]
GLAV
Example Information Manifold [60], PICSEL [48] IBIS [24], DIS@DIS[27], INFOMIX [64] TSIMMIS [45] [34] STYX [8] Agora [73] [90]
ActiveXML [1]
CHAPTER 2. STATE OF THE ART OF DIS
Table 2.1: DIS state of the art
Paradigm
2.3. MAIN RELATED DIS
2.3.1
15
LAV approach
Information Manifold Information Manifold (IM) [67] is a DIS developed at AT&T, based on the CARIN Description Logic [66]. CARIN combines a Description Logic allowing for expressing disjunction of concepts, and role number restrictions, with function-free horn rules. Thus, IM handles the presence of inclusion dependencies over the global schema, and uses conjunctive queries as the language for querying the system and specifying sound LAV mappings. The main distinguishing feature of IM is the use of the bucket algorithm for query answering. In order to illustrare it, we first recall that in LAV the mappings between the sources and the global schema are described as a set of views over the global schema. Thus, query processing amounts to finding a way to answer a query posed over a database schema using a set of views over the same schema. This problem, called answering queries using views, is widely studied in the literature, since it has applications in many areas (see e.g. [53] for a survey). The most common approach proposed to deal with query answering using views is by means of query rewriting. In query rewriting, a query and a set of view definitions over a database schema are provided, and the goal is to reformulate the query into an expression, the rewriting, whose evaluation on the view extensions supplies the answer to the query. Thus, query answering via query rewriting is divided in two steps, where the first one consists of reformulating the query in terms of the given query language over the alphabet of the views (possibly augmented with auxiliary predicates), and the second one evaluates the rewriting over the view extensions. Clearly, the set of available sources may in general not store all the data needed to answer a user query, and therefore the goal is to find a rewriting that provides the maximal set of answers that can be obtained from the views. The bucket algorithm, presented in [65], is actually a query rewriting algorithm that is proved to be sound and complete with respect to the problem of answering user queries (under a first-order logic formalization of the system), only in the absence of integrity constraints on the global schema, but it is in general not complete when integrity constraints are issued on it. StyX According to Fig. 2.1 StyX [8] is based on the use of an object-oriented global schema describing the intensional level of an ontology as a labeled graph, whose nodes represent concepts and edge labels represent either roles (i.e. relationships) between concepts, or inclusion assertions. As for constraints, StyX allows to specify a set of keys over the global schema. On the other hand, StyX allows to integrate XML data sources. These are described in terms of path-to-path mapping rules that associate paths in the XML source with paths in the global schema. Thus, StyX follows the LAV approach. It adresses the problem of query rewriting in the presence of sound LAV mappings. StyX suggests a cute way of merging the two part of this thesis. However, this would require first an analysis of the properties of StyX query answering algorithm (e.g. completeness), and second a deep understanding of the impact of introducing in
CHAPTER 2. STATE OF THE ART OF DIS
16
the StyX global schema a set of constraints comparable to ours. This represents even more an issue, given that StyX does not concern with the detection of inconsistencies among data sources. Agora Agora [73] is an XML-based DIS whose global schema is specified by means of an XML DTD (without any additional integrity constraints). Moreover, Agora is characterized by a set of sound mappings, that follow the LAV approach. More precisely, mappings are defined in terms of an intermediate virtual, generic and relational schema that closely models the generic structure of the XML global schema, rather than in terms of the XML global schema. Thus, Agora query processing technique is based on query rewriting which is performed via a translation first to the generic relational schema and then by employing traditional relational techniques for answering queries using views. Note that because of the translation, queries and mappings can be quite complex and hard to understand/define by a human user.
2.3.2
GAV approach
The TSIMMIS Project TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources) is a joint project of the Stanford University and the Almaden IBM database research group [36]. It is based on an architecture that presents a hierarchy of wrappers and mediators, in which wrappers convert data from each source into a common data model called OEM (Object Exchange Model) and mediators combine and integrate data exported by wrappers or by other mediators. Hence, the global schema is essentially constituted by the set of OEM objects exported by wrappers and mediators. Mediators are defined in terms of a logical language called MSL (Mediator Specification Language), which is essentially Datalog extended to support OEM objects. OEM is a semistructured and self-describing data model, in which each object has an associated label, a type for the value of the object and a value (or a set of values). User queries are posed in terms of objects synthesized at a mediator or directly exported by a wrapper. They are expressed in MSL or in a specific query language called LOREL (Lightweight Object REpository Language), an object-oriented extension of SQL. Each query is processed by a module, the Mediator Specification Interpreter (MSI) [79, 89], consisting of three main components: ❒ The View Expander, which uses mediator specification to reformulate the query into a logical plan by expanding the objects exported by the mediator according to their definitions. The logical plan is a set of MSL rules which refer to information at the sources. ❒ The Plan Generator, also called Cost-Based Optimizer, which develops a physical plan specifying which queries will be sent to the sources, the order in which they will be processed, and how the results of the queries will be combined in order to derive the answer to the original query.
2.3. MAIN RELATED DIS
17
❒ The Execution engine, which executes the physical plan and produces the answer. The problem of query processing in TSIMMIS in the presence of limitations in accessing the sources is addressed in [68] by devising a more complex Plan Generator comprising three modules: ❒ a matcher, which retrieves queries that can process part of the logical plan; ❒ a sequencer, which pieces together the selected source queries in order to construct feasible plans; ❒ an optimizer, which selects the most efficient feasible plan. It has to be stressed that in TSIMMIS no global integration is ever performed. Each mediator performs integration independently. As a result, for example, a certain concept may be seen in completely different and even inconsistent ways by different mediators. This form of integration can be called query-based, since each mediator supports a certain set of queries, i.e., those related to the view it provides. The IBIS system The Internet-Based Information System (IBIS) [25] is a tool for the semantic integration of heterogeneous data sources, developed in the context of a collaboration between the University of Rome “La Sapienza” and CM Sistemi. IBIS adopts innovative solutions to deal with all aspects of a complex data integration environment, including source wrapping, limitations on source access, and query answering under integrity constraints. IBIS uses a relational global schema to query the data at the sources, and is able to cope with a variety of heterogeneous data sources, including data sources on the Web, relational databases, and legacy sources. Each nonrelational source is wrapped to provide a relational view on it. Also, IBIS mappings follow the GAV approach and each source is considered sound. The system allows for the specification of integrity constraints on the global schema; in addition, IBIS considers the presence of some forms of constraints on the source schema, in order to perform runtime optimization during data extraction. In particular, key and foreign key constraints can be specified on the global schema, and functional dependencies and full-width inclusion dependencies, i.e., inclusions between entire relations, can be specified on the source schema. Query processing in IBIS is separated in three phases: 1. the query is expanded to take into account the integrity constraints in the global schema; 2. the atoms in the expanded query are unfolded according to their definition in terms of the mapping, obtaining a query expressed over the sources; 3. the expanded and unfolded query is executed over the retrieved source databases, whose data are extracted by the Extractor module that retrieves from the sources all the tuples that may be used to answer the original query.
18
CHAPTER 2. STATE OF THE ART OF DIS
Query unfolding and execution are the standard steps of query processing in GAV data integration systems, while for the expansion phase IBIS makes use of the algorithm presented in [23]. INFOMIX and DIS@DIS INFOMIX [64] is a semantic integration system that provides solutions for GAV data integration of heterogeneous data sources (e.g., relational, XML, HTML) accessed through relational global schemas over which powerful forms of integrity constraints can be specified (e.g., keys, inclusions, and exclusion dependencies), and user queries are specified in a powerful query language (e.g., Datalog). The query answering technique proposed in such a system is based on query rewriting in Datalog enriched with negation and disjunction, under stable model semantics [26, 49]. A setting similar to the one considered in INFOMIX is the one at the basis of the DIS@DIS system [27]. Even if limited in its capability of integrating sources with different data formats (the system actually considers only relational data sources), DIS@DIS however provides mechanisms also for integration of inconsistent data in LAV. Furthermore, w.r.t. query language considered, INFOMIX and DIS@DIS aim at supporting more general, highly expressive classes of queries (including also queries intractable under worst case complexity). PICSEL Similarly to IM, PICSEL is based on CARIN and the use of conjunctive queries. However, PICSEL differs from IM in that mappings follow a rather simplified GAV approach. More precisely, each data source consists of a set of relations and for each data source there exists a mapping one-to-one from each of its relations to a distinct element of the global schema. In addition, PICSEL takes into account a set of constraints about the content of the sources that are expressed as CARIN assertions. Query expansion in CARIN is then used as the core algorithmic tool for query answering in PICSEL. Thus, query answering in PICSEL is quite efficient, since it is reduced to the evaluation of a union of conjunctive queries over the set of data sources, resulting from the query expansion, which is by itselt exponential in the size of the global schema. The main differences with respect to our investigation are as follows. PICSEL does not consider at all the case where the DIS specification is inconsistent. Also, it does not attempt to distinguish between data and objects. Finally, PICSEL mappings are much more restricted than the one we consider. Grammar AIG The Grammar AIG [18] is a formalism allowing to specify how to integrate and publish SQL data coming from autonomous sources, into an XML document that conforms to a DTD and satisfies a set of integrity constraints very close to the one we also consider. Thus, an AIG evaluation produces a materialized view conforming to a quite expressive global schema. More precisely, an AIG consists of two parts: a grammar and a set of XML constraints. The grammar extends a DTD by associating semantic attributes and semantic rules with element types. The semantic attributes
2.3. MAIN RELATED DIS
19
are used to pass data and control during AIG evaluation. The semantic rules compute the values of the attributes by extracting data from databases via multi-source SQL queries that constitute the mappings. As a result, the XML document is constructed via a controlled derivation from the grammar and constraints, and is thus guaranteed to both conform to the DTD and satisfy the constraints. The focus of [18] is on constraints checking in the sense that whenever during the generation of the document an attribute does not satisfy a constraint, the compilation of the materialized instance is aborted.
XPeranto and SilkRoute Both XPeranto [85] and Silkroute [43] are XML publishin systems that support definition of XML materialized views of SQL data. Moreover, they both support query answering over such XML views, by using an intermediate representation of views. On the one hand, XPeranto uses an XML Query Graph Model (XQGM) as a view. The XQGM is analogous to a physical execution plan produced by a query optimizer. Nodes in the XQGM represent operations in an algebra (e.g., select, join, unnest, union) and edges represent the dataflow from one operation to the next. Individual operations may invoke “XML-aware” procedures for constructing and deconstructing XML values, which gives to XPeranto a procedural flavor. This captures well the relationship between XQuery expressions and complex SQL expressions, whereas it may happen to produce an XQGM that may not be composed with another XQuery query, and thus support arbitrary query answering. On the contrary, SilkRoute uses a view-forest as intermediate abstract representation of views expressed by means of XQuery, that is entirely declarative and thus can be composed with any XQuery query. As a consequance, the two representations are somehow symbiotic: declarative view forests are appropriate for the “front end” query composition whereas the procedural XQGM may be better for “back end” SQL generation.
2.3.3
GLAV approach
XML data exchange basic theoretical issues In the same spirit of our work is the study presented in [12], where the authors start looking into the basic properties of XML data exchange, where the target schema is a DTD. Specifically, they define XML data exchange settings in which sourceto-target dependencies refer to the hierarchical structure of the data. They investigate the consistency problem, which in the case of data exchange, is the problem of deciding whether there exists an instance of the target schema which satisfies both the source-to-target dependencies and the DTD, and determine its exact complexity. Moreover, they identify data exchange settings over which query answering over the target schema is tractable, and those over which it is coNP-complete, depending on classes of regular expressions used in DTDs. Finally, for all tractable cases they provide PT IME algorithms that compute target XML documents over which queries can be answered.
20
CHAPTER 2
Constraint-based XML rewriting The paper [90] proposes a query answering algorithm over an XML-based DIS, whose global schema is characterized by a set of expressive even though rather complicated constraints, called nested equality-generating dependencies (NEGDs), that include functional dependencies as XML keys, foreign keys and more general constraints stating that certain tuples/elements in the target must satisfy certain equalities. The mappings are sound and are expressed by means of the mapping language proposed in Clio [82], which means that they follow the GLAV approach. The main problem studied in [90] is query rewriting. Thus, according to the distinction discussed in [33], even though related, such a study attempts a different issue with respect to the one we address, which does not aim at finding a query rewriting. Moreover, [90] does not deal with the detection (nor resolution) of conflicts that may arise due to target constraints.
Part II
Ontology-based DIS
21
In this part of the thesis, we investigate ontology-based DIS. These are data integration systems whose global schema is described as the intensional level of an ontology, i.e., the shared conceptualization of a domain of interest. We are interested, in particular, in ontologies expressed by means of logic-based languages, specifically Description Logics (DLs) [14]. Indeed, OWL 3 , the main current standard language for ontology descriptions is based on such formalisms. In a nutshell, DLs have been developed and tailored over the years in Artificial Intelligence and Computational Logic to represent formally knowledge about a domain of interest in terms of concepts (or classes), which denote sets of objects, and roles (or relations), which denote denote binary relations between objects. DLs knowledge bases are formed by two distinct parts: the so-called TBox, which contains intensional description of the domain of interest; and the so-called ABox, which contains extensional information. When DLs are used to express ontologies [16], the TBox is used to express the intensional level of the ontology, while the ABox is used to represent the instance level of the ontology, i.e., the information on actual objects that are instances of the concepts and roles defined at the intensional level. From a formal point of view, a DL knowledge base is a pair K = hT , Ai, where: ❒ T , the TBox, is formed by a finite set of universal assertions. The precise form of such assertions depends on the specific DL. However, we insist that the TBox actually mainly place constraints on the extensions of the primitive concepts and roles used to describe the domain of interest 4 . ❒ A, the ABox, is formed by a finite set of membership assertions stating that a given object (or pair of objects) is an instance of a concept (or a role). When we talk about ontology-based DIS, the extensional level of the ontology is not represented anymore as an ABox, but rather it is provided by both a set of existing data sources and a set of mappings, expressing the relationship between the concepts and the roles of the intensional level of the ontology, i.e. the global schema, and the data managed by a relational DBMS. To understand which DL would be suited to act as the formalism for representing the global schema of ontology-based DIS, clearly, we need to build on results of recent research in DLs. In particular, results of [30, 76, 57] showed that none of the variants of OWL is suitable in that they all are coNP-hard w.r.t. data complexity. 3
OWL Web Ontology Language Overview, http://www.w3.org/TR/owl-features/ This contrasts with TBoxes, sometimes called acyclic, which consist of a finite set of definition assertions used to introduce defined concepts, i.e. abbreviations for complex combinations of primitive concepts and roles, such that a defined concept cannot refer to the concept itself. 4
23
24 Possible restrictions that guarantee polynomial reasoning (at least, if we concentrate on instance checking only) have been also looked at, such as Horn-SHIQ [57], EL++ [13], DLP [50]. Among such fragments, we choose here to focus on those belonging to the DL-Lite family [29, 30], since these allow for answering (unions of) conjunctive queries (i.e. SQL select-project-join queries) in L OG S PACE w.r.t. data complexity. More importantly, they allow for delegating query processing, after a preprocessing phase which is independent of the data, to the relational DBMS managing the data layer, i.e. the ABox. This last property is obviously crucial in ontology-based DIS, where relational data sources provide the intensional level of the ontology. In the investigation of ontology-based DIS, we are also interested in write-also DIS, i.e. data integration systems that allow the user to perform updates over the extensional level of an ontology, i.e. the data sources. DIS updates in this context are related to the need of changing an ontology in order to reflect a change in the domain of interest the ontology is supposed to represent. Generally speaking, an update is represented by a formula that is intended to sanction a set of properties that are true in the state resulting from the change. One of the major challenges when dealing with an update is how to react to the case where the update is inconsistent with the current knowledge. Clearly, in order to study updates over an ontology-based DIS, we need to build on results on DL ontology updates. However, despite the importance of update, this issue is largely unexplored. Notable exceptions are [52, 69]. In particular, in [69] the authors propose a formal semantics for updates in DLs, and present interesting results on various aspects related to computing updates. However, since the problem is addressed under the assumption that the knowledge base is specified only at the extensional (i.e., instance) level, the paper does not take into account the impact of the intensional level on ontology update. Thus, as a first step toward write-also ontology-based DIS, we present here the first results of a systematic investigation on the notion of update of ontologies expressed as DL knowledge bases, where the intensional level of the ontology is assumed to be invariant, i.e., it does not change while the KB is used5 , while the instance-level of the ontology describes the state of affairs regarding the instances of concepts, which can indeed change as information in it is updated. The main contributions of this part of the thesis are as follows. ❒ First, we define a new language, called DL-LiteA , that is particularly tailored to represent ontologies in a DIS setting. In particular, DL-LiteA allows for distinguishing between values and objects. ❒ Second, we study the main reasoning services offered by a DL-LiteA KB. In particular, we provide algorithms to check DL-LiteA KB satisfiability and solve query answering over a DL-LiteA KB. We prove that these algorithms are correct and show that they run in L OG S PACE in data complexity. ❒ Third, we propose a formal framework for DL-LiteA ontology-based DIS. We show that in DL-LiteA DIS, reasoning can be separated from the access to 5
In other words, in this paper we are not considering the so-called “ontology evolution problem”.
25 actual data sources. Then, we provide algorithms to solve DIS consistency and query answering by appropriately exploiting this nice features of DL-LiteA DIS. We prove that these algorithms are correct, and again, L OG S PACE in data complexity. ❒ Fourth, we define the notion of update of the extensional level of an ontology. Building on classical approaches on knowledge base update, we provide a general semantics for instance level update in DLs. In particular, we follow the approach of [69], and we adapt Winslett’s semantics [87, 88] to the case where the ontology is described by both a TBox and an ABox. ❒ Finally, we study update over a KB expressed in a restricted variant of DL-LiteA KB, called DL-LiteF S . We prove that DL-LiteF S is closed with respect to instance level update, in the sense that the result of an update is always expressible by a new DL-LiteF S ABox. Then, we provide an algorithm that computes the update over a DL-LiteF S KB. We prove that this algorithm is correct, and we show that it runs in polynomial time with respect to the size of the original knowledge base. To the best of our knowledge, this is the first algorithm for a well-founded approach to ontology update in DLs taking into account both the TBox and the ABox. This part of the thesis comes from an expansion and an updated version of a OWLED Workshop paper [35] and a AAAI conference paper [47]. It is organized as follows. Below, we briefly present the works that are most closely related to our. In Chapter 3, we present the DL DL-LiteA that is used to express the DIS global schema. In Chapter 4, we investigate DL-LiteA KBs satisfiability and query answering. In Chapter 5 we set up the logical framework for ontology-based data integration and provide algorithms to solve DIS consistency and query answering. Finally, in Chapter 6, we investigate instance-level updates of DL ontologies and provide an algorithm to compute an update over a DL-LiteA KB.
26
Chapter 3
The language In this chapter, we present a new logic of the DL-Lite family [30], called DL-LiteA . To this aim, we start by introducing DL-LiteF RS , that is a new DL particularly tailored to represent ontologies. Then, we present the query language, i.e. conjunctive queries. Finally, since while quite interesting in general, DL-LiteF RS loses the most important feature of DLs belonging to the DL-Lite family, i.e. the ability of delegating the query processing to a relational DBMS, we define DL-LiteA by imposing some restrictions to DL-LiteF RS .
3.1 DL-LiteF RS DL-LiteF RS is a new DL, whose novel aspects w.r.t. other DLs of the DL-Lite family [30, 31], are as follows. ❒ DL-LiteF RS takes seriously the distinction between objects and values, by allowing to use: – value-domains, a.k.a. concrete domains [15], denoting sets of (data) values, – concept attributes, denoting binary relations between objects and values, and – role attributes, denoting binary relations between pairs of objects and values 1 . ❒ DL-LiteF RS allows to express the existence of objects (or values) that are instances of concepts (resp. value-domains), without naming the actual objects (resp. values), by means of the so-called soft constants. Whereas these features are all provided by OWL 2 , the distinction between objects and values is typically blurred in DLs. Nevertheless, as already discussed, none of the OWL variants [77], neither OWL, nor OWL-DL, nor OWL-Lite, would be 1
Obviously, a role attribute can be also seen as a ternary relation relating two objects and a value. In fact, role attributes are currently not available in OWL, but are present in most conceptual modeling formalisms such as UML class diagrams and Entity-Relationship diagrams. 2
27
CHAPTER 3. THE LANGUAGE
28
suited to act as the formalism for representing ontologies in the context of DIS, given that, if not restricted, they all provide reasoning services that are coNP-hard in data complexity.
3.1.1
DL-LiteF RS expressions
In providing the specification of our logics, we use the following notation: ❒ A denotes an atomic concept, B a basic concept, and C a general concept; ❒ D denotes an atomic value-domain, E a basic value-domain, and F a general value-domain; ❒ P denotes an atomic role, Q a basic role, and R a general role; ❒ UC denotes an atomic concept attribute, and VC a general concept attribute; ❒ UR denotes an atomic role attribute, and VR a general role attribute; ❒ ⊤C denotes the universal concept, ⊤D denotes the universal value-domain. Given a concept attribute UC (resp. a role attribute UR ), we call the domain of UC (resp. UR ), denoted by δ(UC ) (resp. δ(UR )), the set of objects (resp. of pairs of objects) that UC (resp. UR ) relates to values, and we call range of UC (resp. UR ), denoted by ρ(UC ) (resp. ρ(UR )), the set of values that UC (resp. UR ) relates to objects (resp. pairs of objects). Notice that the domain δ(UC ) of a concept attribute UC is a concept, whereas the domain δ(UR ) of a role attribute UR is a role. Furthermore, we denote with δF (UC ) (resp. δF (UR )) the set of objects (resp. of pairs of objects) that UC (resp. UR ) relates to values in the value-domain F . In particular, DL-LiteF RS expressions are defined as follows. ❒ Concept expressions: B ::= A | ∃Q | δ(UC ) C ::= ⊤C | B | ¬B | ∃Q.C | δF (UC ) | ∃δF (UR ) | ∃δF (UR )− ❒ Value-domain expressions (rdfDataType denotes predefined value-domains such as integers, strings, etc.): E ::= D | ρ(UC ) | ρ(UR ) F ::= ⊤D | E | ¬E | rdfDataType ❒ Attribute expressions: VC ::= UC | ¬UC VR ::= UR | ¬UR ❒ Role expressions: Q ::= P | P − | δ(UR ) | δ(UR )− R ::= Q | ¬Q | δF (UR ) | δF (UR )−
3.1. DL-LITEF RS
29
In the value-domain expression above, rdfDataType denotes predefined valuedomains, such as integers, strings, etc., that correspond to the RDF data types 3 . Coherently with RDF, we assume that such data types are pairwise disjoint. In the following, we denote each such domain by T , possibly with subscript, i.e., we assume rdfDataType ::= T1 | . . . | Tn . As usual in DLs, the semantics of DL-LiteF RS is given in terms of first-order logic interpretations. More precisely, an interpretation I = (∆I , ·I ) consists of: ❒ a first order structure over the interpretation domain ∆I that is the disjoint union of two domains: – ∆O I , called the interpretation domain of objects, and – ∆V I , called the interpretation domain of (data) values, ❒ an interpretation function ·I such that (i) for each rdfDataType Ti , it holds that TiI ⊆ ∆V I , and for each pair of rdfDataType, Ti , Tj , with i 6= j, it holds that TiI ∩ TjI = ∅; and (ii) the following conditions are satisfied: ⊤IC = ∆O I ⊤ID = ∆V I AI ⊆ ∆O I DI ⊆ ∆V I P I ⊆ ∆O I × ∆O I UCI ⊆ ∆O I × ∆V I URI ⊆ ∆O I × ∆O I × ∆V I (¬B)I = ∆O I \ B I (¬E)I = ∆V I \ E I (¬Q)I = (∆O I × ∆O I ) \ QI (¬UC )I = (∆O I × ∆V I ) \ UCI (¬UR )I = (∆O I × ∆O I × ∆V I ) \ URI
3.1.2
(ρ(UC ))I = { v | ∃o. (o, v) ∈ UCI } (ρ(UR ))I = { v | ∃o, o′ . (o, o′ , v) ∈ URI } (P − )I = { (o, o′ ) | (o′ , o) ∈ P I } (δF (UC ))I = { o | ∃v. (o, v) ∈ UCI ∧ v ∈ F I } (δ(UC ))I = (δ⊤D (UC ))I (δF (UR ))I = { (o, o′ ) | ∃v. (o, o′ , v) ∈ URI ∧ v ∈ F I } (δ(UR ))I = (δ⊤D (UR ))I (δF (UR )− )I = { (o, o′ ) | ∃v. (o′ , o, v) ∈ URI ∧ v ∈ F I } (δ(UR )− )I = (δ⊤D (UR ))I (∃δF (UR ))I = { o | ∃ o′ . (o, o′ ) ∈ (δF (UR ))I } (∃δF (UR )− )I = { o | ∃ o′ . (o, o′ ) ∈ (δF (UR )− )I } (∃Q)I = { o | ∃o′ . (o, o′ ) ∈ QI } (∃Q.C)I = { o | ∃o′ . (o, o′ ) ∈ QI ∧ o′ ∈ C I }
DL-LiteF RS TBox
DL-LiteF RS TBox assertions are of the form:
3
B⊑C Q⊑R E⊑F UC ⊑ V C UR ⊑ V R
concept inclusion assertion role inclusion assertion value-domain inclusion assertion concept attribute inclusion assertion role attribute inclusion assertion
(funct P ) (funct P − ) (funct UC ) (funct UR )
role functionality assertion inverse role functionality assertion concept attribute functionality assertion role attribute functionality assertion
Resource Description Framework (RDF),http://www.w3.org/RDF/
CHAPTER 3. THE LANGUAGE
30
A concept inclusion assertion expresses that a (basic) concept B is subsumed by a (general) concept C. Analogously for the other types of inclusion assertions. A role functionality assertion expresses the (global) functionality of an atomic role. Analogously for the other types of functionality assertions. Note that in the sequel we will sometimes consider a TBox T as the disjoint union of Tp , Tk and Tni where: ❒ Tp is the set of all inclusion assertions (of any type), called Positive Inclusion assertion (PI), having a positive expression in the right-hand side; ❒ Tni the set of all inclusion assertions (of any type), called Negative Inclusion assertions (NI), having a negated expression in the right-hand side; ❒ Tk is the set of all functionality assertions (of any type). We now give the semantics of a TBox T , again in terms of interpretations of An interpretation I = (∆I , ·I ) is a model of a DL-LiteF RS TBox T , written I ∈ M od(K), or equivalently, I satisfies T , written I |= T , if I satisfies each assertion α in T . More precisely: (∆I , ·I ) over the domain ∆I .
❒ if α is an inclusion assertion α ⊑ β, where α may denote either a concept, or a role, or a value-domain, or a concept attribute, or a role attribute, we must have: αI ⊆ β I ❒ if α is a role functionality assertion (funct Q), where Q is either P ,or P − , we must have for each o1 , o2 , o3 : (o1 , o2 ) ∈ QI ∧ (o1 , o3 ) ∈ QI ⇒ o2 = o3 ❒ if α is a concept attribute functionality assertion (funct UC ), we must have for each o, v1 , v2 : (o, v1 ) ∈ UCI ∧ (o, v2 ) ∈ UCI ⇒ v1 = v2 ❒ if α is a role attribute functionality assertion (funct UR ), we must have for each o1 , o2 , v1 , v2 , (o1 , o2 , v1 ) ∈ URI ∧ (o1 , o2 , v2 ) ∈ URI ⇒ v1 = v2 . where each o, possibly with subscript, is an element of ∆O I , whereas each v, possibly with subscript, is an element of ∆V I . We next give an example of a DL-LiteF RS TBox, with the aim of highlighting the use of attributes (in particular, role attributes). Note that in all the following examples, concept names are written in lowercase, role names are written in UPPERCASE, attribute names are in sans serif font, and domain names are in typewriter font.
3.1. DL-LITEF RS
31
Example 3.1.1 Let T be the TBox containing the following assertions: tempEmp ⊑ employee
(3.1)
manager ⊑ employee
(3.2)
employee ⊑ person
(3.3)
employee ⊑ ∃WORKS-FOR.project
(3.4)
person ⊑ δ(persName)
(3.5)
ρ(persName) ⊑ xsd:string
(3.6)
(funct persName)
(3.7)
project ⊑ δ(projName)
(3.8)
ρ(projName) ⊑ xsd:string
(3.9)
(funct projName)
(3.10)
tempEmp ⊑ ∃δ(until)
(3.11)
δ(until) ⊑ WORKS-FOR
(3.12)
(funct until) ρ(until) ⊑ xsd:date
(3.13) (3.14)
(funct MANAGES)
(3.15)
MANAGES ⊑ WORKS-FOR
(3.16)
manager ⊑ ¬∃δ(until)
(3.17)
The above TBox T models information about employees and projects. Specifically, the assertions in T state the following. Both managers and fixed-term employees (tempEmp) are two types of employees ( 3.2, 3.1), where an employee is a person ( 3.3) working for a project 3.4), and a person and a project are both always characterized by a unique name ( 3.8, 3.10 3.5, 3.7, ). In particular, a person name and a project name may be any string ( 3.6, 3.9). Moreover, someone who manages a project works for that project ( 3.16). Note however that an employee can manage at most one project ( 3.15). Finally, the until role attribute possibly associates a unique date term ( 3.14) with an employment ( 3.12, 3.13). Thus, T allows to express that a fixed-term employee works for at least one project until a fixed date ( 3.11), whereas a manager is someone who has only permanent positions ( 3.17). Note that this implies that there exists no employee who is simultaneously a fixed-term employee and a manager.
3.1.3
DL-LiteF RS ABox
We now focus on DL-LiteF RS ABox. To this aim, we introduce an alphabet of hard constants Γ, for short constants, that is the disjoint union of two alphabets, ΓO and ΓV , respectively. Symbols in ΓO , called object identifiers (or also object constants), are used to denote objects, while symbols in ΓV , called value constants, are used to denote data values. Moreover, we introduce an alphabet of soft constants V. Coherently with Γ, V is the disjoint union of two sets of VO and VV , representing respectively constants in ΓO and ΓV . A DL-LiteF RS ABox over Γ is a finite set of
CHAPTER 3. THE LANGUAGE
32
assertions, called membership assertions, of the form: C(a),
C(so ),
F (d),
F (sv )
R(a, b),
VC (a, d),
VR (a, b, d)
where a and b are constants in ΓO , so , sv are soft constants resp. in VO , VV , and d is a constant in ΓV . An assertion involving only constants is called ground. Let us focus on soft constants. Soft constant are used to express the existence of objects (or values) that are instances of concepts (resp. value-domains), without actually naming their object ids (resp. value constants). In other words, soft constants are constants for which the unique name assumption does not hold. It is worth noting that, according to the syntax above, soft constants can occur inside concepts or valuedomains, whereas they cannot occur inside roles 4 . Inspite of this restriction, the following example shows that soft constants actually add expressive power (which will also be clearer when discussing updates in Chapter 6). Example 3.1.2 Consider the following two ABoxes: ❒ A1 = {A(a), B(b)} (b constant), and ❒ A2 = {A(a), B(x)} (x soft constant). They have not the same set of models, since A1 is such that for each interpretation I1 = (∆I1 , ·I1 ) that is a model of A1 , A and B are interpreted as two sets of objects which contain respectively at least one object oA , oB of the domain, such that ·I1 (a) = oA , and ·I1 (b) = oB , where oA 6= oB . Clearly, each such model I1 is also a model of A2 . Now let I2 = (∆I2 , ·I2 ) be an interpretation such that A and B contain uniquely the same object o of the domain. Then I2 is a model of A2 , with assignment µ such that µ(x) = o, where ·I2 (a) = o. On the contrary, I2 is not a model of A1 , with any assignment. From the above example it follows that if we were able to express for which constant the unique name assumption holds, then soft constants would not add expressive power. However, from a technical point of view, by following such an approach we would have to change (and complicate) all the definitions we give in Chapter 6 that lead to the definition of update (e.g. the difference among interpretations). In order to give the semantics of a DL-LiteF RS ABox in terms of interpretations I (∆ , ·I ), since the ABox may involve soft constants, whereas ·I is a function from the set of constants Γ to the domain ∆I , we need to introduce the preliminary notion of assignment. Definition 3.1.3 Let V be the disjoint union of the sets of soft constants VO and VV , and ∆I the disjoint union of ∆O I and ∆V I . Given an ABox A, we call assignment for A a function µ from V to ∆I such that: ❒ for each so ∈ VO occurring in A, µ(so ) = o ∈ ∆O I ; ❒ for each sv ∈ VV occurring in A, µ(sv ) = v ∈ ∆V I . 4
From a technical point of view, the reason for this restriction is that the presence of soft constants in roles would make reasoning much less efficient, since it would possibly require to recursively unify soft constants according to role functionality assertions.
3.1. DL-LITEF RS
33
Let ∆I be the disjoint union of ∆O I and ∆V I , and I = (∆I , ·I ) an interpretation. Moreover, let µ be an assignment for A. We say that I is model of A with µ, or equivalently, I satisfies A with µ, written I |= A[µ], if the following conditions are satisfied: ❒ First I = (∆I , ·I ) assigns to each constant in ΓO , ΓV a distinct element of ∆O I , ∆V I respectively, as follows: – for all a ∈ ΓO , we have that aI ∈ ∆O I ; – for all a, b ∈ ΓO , we have that a 6= b implies aI 6= bI ; – for all d ∈ ΓV , we have that cI ∈ ∆V I ; – for all d, e ∈ ΓV , we have that d 6= e implies dI 6= eI ; ❒ Second, I satisfies each membership assertion in A, written I |= α[µ]. More precisely, for each membership assertion α ∈ A, we have that: – if α = C(a), with a ∈ ΓO , then aI ∈ C I ; – if α = C(so ), with so ∈ VO , then µ(so ) ∈ C I ; – if α = F (d), with d ∈ ΓV , then dI ∈ F I ; – if α = C(sv ), with sv ∈ VV , then µ(sv ) ∈ F I ; – if α = R(b1 , b2 ), with b1 , b2 ∈ ΓO , then (bI1 , bI2 ) ∈ RI , – if α = VC (b, d), with b ∈ ΓO and d ∈ ΓV , then (bI , dI ) ∈ VCI , – if α = VR (b1 , b2 , d), with bi ∈ ΓO and d ∈ ΓV , then (b1 , b2 , d) ∈ VRI . Finally, we say that I is a model of A if there exists an assignment µ for A such that I is a model of A with µ. Thus we define the set M od(A) of models of A as follows: M od(A) = {I | ∃µ, I |= A[µ]}. We now give an example of ABox. Note that in all examples that follow, object constants in ΓO are written in bold face font, whereas value constants in ΓV are written in slanted font. Example 3.1.4 Consider the following ABox A, where z ∈ VO : tempEmp(z),
(3.18)
until(z, DIS-1212, 25-09-05),
(3.19)
projName(DIS-1212, QuOnto),
(3.20)
manager(Lenz)
(3.21)
Specifically, the ABox assertions in A state that there exists an object denoting a fixed-term employee ( 3.18). Moreover, the name of DIS-1212 is QuOnto, and the object identified by Lenz is a manager.
CHAPTER 3. THE LANGUAGE
34
3.1.4
DL-LiteF RS knowledge base
Now that we have introduced DL-LiteF RS TBox and ABox, we are finally able to define when an interpretation is a model of a DL-LiteF RS KB K. Let µ be an assignment for A. An interpretation I is a model of a KB K = hT , Ai with µ, written I |= K[µ], if I is a model of T and A with µ. A KB is satisfiable if it has at least one model, i.e. if there exists at least an interpretation I and an assignment µ for A, such that I is a model of K with µ. Thus, we have that: M od(K) = {I | ∃µ,
I |= A[µ] ∧ I |= T }.
Given a DL-LiteF RS assertion α, a KB K logically implies a ground assertion α, written K |= α, if for each model I of K, we have that I is a model of α. 5 Example 3.1.5 Let K = hT , Ai be the knowledge base whose TBox T is the one of Example 3.1.1, and ABox A is the one of Example 3.1.4. Clearly, K is satisfiable. Indeed, a possible model I for K is described as follows. First, µ is an assignment of the soft constant in A such that µ(z) = Palm, where Palm denotes a fixed-term employee. Lenz denotes a manager, and as such, Lenz manages exactly one project. In particular, in I, Lenz manages the project DIS-1212, identified by DIS-1212 and named ”QuOnto”, for which he works permanently. Moreover, Lenz works permanently for the project denoted by the object id FP6-7603. However, Lenz does not manage FP6-7603, since otherwise I would violate the functionality assertion 3.13 of T . On the other hand, another model I ′ may be such that Lenz would manage FP6-7603, whereas he would not manage DIS-1212. Note, finally, that there exists no model of K such that Lenz is interpreted as a fixed-term employee (and thus there exists no assignment µ such that µ(z)=Lenz), since according to 3.21 of A, Lenz is a manager and, as observed in Example 3.1.1, the sets of managers and fixed-term employees are disjoint. Before presenting, in the next section , the query language we use for the investigation of ontology-based DIS, we next introduce a notion that will be useful in the sequel. Definition 3.1.6 Let K = hT , Ai be a DL-LiteF RS KB and I an interpretation for K. Then we call most general assignment for A w.r.t. I an assignment µ0 for A, that satisfies the following conditions: ❒ for each C(so ) ∈ A, so ∈ VO , µ0 (so ) = on , where on is a fresh object in ∆O I , and ❒ for each F (sv ) ∈ A, sv ∈ VV , µ0 (sv ) = vn , where vn is a fresh value in ∆V I , where we say that µ0 (s) is a fresh object (or value), if µ0 (s) denotes an object (or resp. a value) such that each constant c and each soft constant s′ 6= s occurring in A, cI 6= µ0 (x) and respectively µ0 (s′ ) 6= µ0 (s). 5
Note that we are not interested here in the logical implication of formulas that are not ground, even though, clearly, such a notion may easily be obtained by an obvious generalization of the notion of logical implication of ground formulas.
3.1. DL-LITEF RS
35
Intuitively, a most general assignment is an assignment ensuring that soft constant names are treated as individual constant names. It is straightforward to prove the following. Proposition 3.1.7 Let K = hT , Ai be a DL-LiteF RS KB. Then, for each couple of most general assignments µ0 and µ0 ′ w.r.t. I, µ0 6= µ0 ′ , we have that I |= K[µ0 ] ⇐⇒ I |= K[µ0 ′ ]. Moreover, most general assignments have the following interesting property. Proposition 3.1.8 Let K = hT , Ai be a DL-LiteF RS KB. Then, K is satisfiable ⇐⇒ ∃I, µ0 | I |= K[µ0 ]. where µ0 is a most general assignment for A w.r.t. I.
Proof. “⇒”: Trivial (by definition). “⇐”: Suppose that K is satisfiable. Then, there exists an assignment µ for A and an interpretation J such that J |= K[µ]. Suppose now by contradiction that there exists no interpretation I that is a model of K with some most general assignment for A w.r.t. I. Thus we have in particular that µ is not a most general assignment for A w.r.t. J. Thus, let s¯ a soft constant in A and let µ0 J be such that: ❒ µ0 J (¯ s) 6= µ(¯ s), and ❒ µ0 J (¯ s) 6= µ(s), for each s in A, s 6= s¯. Then we have that there exists a membership assertion C(¯ s) in A such that (i) either J µ(¯ s) = o = a , for some constant a occurring in A, or (ii) µ(¯ s) = o = µ(s), for some soft constant s 6= s¯. Since J 6|= K[µ0 ], in both cases, there must exist at least one assertion α in K such that J |= α[µ] and J 6|= α[µ0 ]. But then α must involve s¯ since: ❒ if α does not involve any soft constant, then, clearly, either α is satisfied by J with both µ and µ0 or α is not satisfied by J with any of the two assignments; ❒ if α involves a soft constant y 6= s¯, by hypothesis, µ0 J (y) = µ(y), and thus α is satisfied by J with both the assignments µ0 and µ. Therefore, α must be a membership assertion of the form C ′ (¯ s), and since J 6|= C ′ (¯ s)[µ0 J ], we have that µ0 J (¯ s) ∈ / C ′J . But since µ0 J is a most general assignment for A w.r.t. J, then µ0 J (¯ s) is a fresh object. Thus, it is always possible to build an assignment µ¯0 J that is identical to µ0 J except for the fact that µ¯0 J (s) ∈ C ′I . Clearly, µ¯0 J is a most general assignment for A w.r.t. J. Moreover, J is a model of K with µ¯0 [J]. Thus, we obtain a contradiction. Intuitively, the above proposition shows that in order to study DL-LiteA KB satisfiability, we can essentially abstract from the presence of soft constants, by considering them as distinct hard constants.
CHAPTER 3. THE LANGUAGE
36
3.2 Query language A conjunctive query (CQ) q over a DL-LiteF RS ontology is an expression of the form q(~x) ← conj(~x, ~y ), where ~x is a tuple of distinct variables, the so-called distinguished variables, ~y is a tuple of distinct existentially quantified variables (not occurring in ~x), called the non-distinguished variables, and conj(~x, ~y ) is a conjunction of atoms of the form A(xo ), P (xo , yo ), D(xv ), UC (xo , xv ), or UR (xo , yo , xv ), xo = yo , xv = yv , where: ❒ A, P, D, UC , and UR are resp. an atomic concept, an atomic role, an atomic value-domain, an atomic concept attribute and an atomic role attribute in T , ❒ xo , yo are either variables in ~x and ~y , called object variables, or constants in ΓO , ❒ xv is either a variable in ~x and ~y , called a value variable, or a constant in ΓV . We say that q(~x) is the head of the query whereas conj(~x, ~y ) is the body. Moreover, the arity of q is the arity of ~x. Finally, a union of conjunctive queries (UCQ) is a query of the form: [ Q(~x) ← conji (~x, y~i ). i
Given an interpretation I = (∆I , ·I ), the query Q(~x) ← ϕ(~x, ~y ) (either a conjunctive query or a union of conjunctive queries) is interpreted in I as the set of tuples o~x ∈ ∆I × · · · × ∆I such that there exists o~y ∈ ∆I × · · · × ∆I such that if we assign to the tuple of variables (~x, ~y ) the tuple (o~x , o~y ) the formula ϕ(o~x , o~y ) is true in I [5]. Then, given a tuple ~t of elements of Γ (we recall that Γ is the disjoint union of the objects and value constants ΓO and ΓV ), we say that ~t is a certain answer to q over K, written ~t ∈ ans(Q, K), if for each interpretation I that is a model of K, we have that ~tI ∈ QI . Thus, as for the DL-LiteF RS assertions, we say that K logically implies Q(~t), written K |= Q(~t), where Q(~t) is obtained from Q(~x) by substituting ~x with ~t. Example 3.2.1 Let K be the knowledge base introduced in Example 3.1.5. Suppose first that we pose the following query, asking for all employees: q(x) ← employee(x). One can verify that the set of certain answers is {Lenz}. Indeed, Lenz is the only object id denoting an employee in all possible models, with any assignment. Suppose now that we ask for all couples participating to the role WORKS-FOR: q(x, y) ← WORKS-FOR(x, y). We then obtain no answer, since there exists no couple of object ids (a, b) in Γ such that (aI , bI ) ∈ q I for all models I of K.
3.3. DL-LITEA
37
Proposition 3.2.2 Let K = hT , Ai be a satisfiable DL-LiteF RS KB, and Q a union of conjunctive queries over K of arity n. Moreover, let m be the number of distinct soft constants sj occurring in A. Then, ans(Q, K) = {~t = (t1 , · · · , tn ) | ∀I, ∃µ0 I , I |= K[µ0 I ] ⇒ (~tI ∈ QI ∧ ∀i ∈ {1, · · · , n}, ∀j ∈ {1, · · · , m}, tIi 6= µ0 I (sj ))} where µ0 I denotes a most general assignment for A w.r.t. I.
Proof. In order to prove the theorem, we denote as R0 the set R0 = {~t = (t1 , · · · , tn ) | ∀I, ∃µ0 I , I |= K[µ0 I ] ⇒ (~tI ∈ QI ∧ ∀i ∈ {1, · · · , n}, ∀j ∈ {1, · · · , m}, tIi 6= µ0 I (sj ))} and then we show that R0 ⊇ ans(Q, K), and R0 ⊆ ans(Q, K). ⊇: Trivial, by Proposition 3.1.7 ⊆: Let I be a model of K and let ~t ∈ Γn be a tuple of constants such that ~t ∈ ans(Q, K). Then, we have in particular that I |= Q(~t). From Proposition 3.1.8, since K is satisfiable, there exists a most general assignment µ0 I for A w.r.t. I such that I |= K[µ0 I ]. Let us now show that tIi 6= µ0 I (sj ) for each i ∈ {1, · · · , n} and j ∈ {1, · · · , m}. To this aim, suppose by contradiction that tIi = µ0 I (sj ), for some i, j such that sj is a soft constant occurring in a membership assertion X(sj ), where X may denote either a concept or a value-domain. Then we can define a most general assignment µ0 ′ that is identical to µ0 I except for the assignment of sj , i.e. µ0 ′ (sj ) 6= µ0 I (sj ). Since by definition µ0 I (sj ) is an arbitrary fresh constant, we can ′ ′ construct a model I ′ by modifying I so that (i) µ0 ′ (sj ) ∈ / X I , and (ii) µ0 ′ (sj ) ∈ X I . Then, clearly, I ′ is a model of K with µ0 ′ . Moreover, I ′ 6|= Q(~t), thus contradicting the hypothesis of ~t ∈ ans(Q, K). Note that the above proposition plays the same role for the query answering problem that Proposition 3.1.8 plays for KB satisfiability. Indeed, it shows that given a query Q, in order to compute all certain answers to Q over a KB it is sufficient to consider only most general assignments for K. Thus, in particular, this allows to compute the certain answers to Q over a KB, by first considering each soft constant as a distinct hard constant, and finally eliminating those tuples that contain these newly introduced constants.
3.3 DL-LiteA Let us now compare the main features of DL-LiteF RS with those of other DLs in the DL-Lite family. ❒ First, DL-LiteF RS ABox allows for membership assertions involving general concepts and roles (as well as general value-domains, concept attributes, and role attributes).
CHAPTER 3. THE LANGUAGE
38
❒ Second, DL-LiteF RS allows for the representation of: – the universal concept ⊤C (and universal domain value ⊤D ); – qualified existential quantifications, i.e. expressions of the form: ∃R.C, δF (UC ), ∃δF (UR ), ∃δF (UR )− , δF (UR ), δF (UR )− . ❒ Third, DL-LiteF RS combines the main features of DL-LiteF and DL-LiteR , since it allows both functional restrictions on roles, mandatory participation on roles and disjointness between roles. ❒ Fourth, DL-LiteF RS distinguishes between objects and values and for this it introduces besides concepts and roles, also value-domains, concept attributes and role-attributes. None of the other DLs in the DL-Lite family (nor in other DLs we are aware of) allows for such a distinction. ❒ Fifth, DL-LiteF RS ABox allows for the occurrence of soft constants in membership assertions involving concepts (and value-domains). Next, we show that we can reduce general DL-LiteF RS KBs to DL-LiteF RS KBs that are equivalent, in terms of query answering, and have a much rawer form, called basic. Such a form recalls the form of other DLs in the DL-Lite family, since it does not exploit the two last features of DL-LiteF RS above mentioned. To show this, we start by defining basic DL-LiteF RS KBs. Definition 3.3.1 Let K = hT , Ai be a DL-LiteF RS KB. We say that K is a basic DL-LiteF RS KB if it is such that: ❒ the right-hand side of each concept inclusion assertion in T has the form: B | ¬B where B denotes a basic concept; ❒ the right-hand side of each role inclusion assertion in T has the form: Q | ¬Q where Q denotes as usual a basic role; ❒ all membership assertions in A involve only atomic concepts, atomic valuedomains, atomic concept attributes, atomic role attributes, and atomic roles. Example 3.3.2 One can easily verify that the KB=hT , Ai such that T is the TBox of Ex. 3.1.1 and A is the ABox of Ex. 3.1.4 is a basic DL-LiteF RS KB. We now show that it is possible to convert a general DL-LiteF RS KB into a basic DL-LiteF RS KB that is equivalent to the initial KB from the point of view of KB satisfiability, query answering, and therefore from the point of view of main reasoning services. Intuitively, this can be done by compiling away all qualified existential
3.3. DL-LITEA
39
quantifications in the right-hand side of both concept and role inclusion assertions by rewriting them through the use of auxiliary roles. Similarly, all membership assertions assertions involving complex expressions can be compiled away by rewriting them through the use of auxiliary expressions. Specifically, given a DL-LiteF RS KB K, we denote by Conv(K) a KB that is obtained from K by replacing each assertion α involving an expression Y with a set of assertions S(α), according to the rules shown in Fig. 3.1, where we have marked in bold newly introduced auxiliary expressions, that may be either concepts, valuedomains, concept attributes, role attributes, or roles. Then, we have the following. Lemma 3.3.3 Let K be a DL-LiteF RS KB. Then, we have that: 1. K is satisfiable, if and only if Conv(K) is satisfiable; 2. for each conjunctive query q (not involving the newly introduced auxiliary expressions), and for each tuple ~t of elements of ΓV ∪ ΓO , ~t ∈ ans(q, K) if and only if ~t ∈ ans(q, Conv(K)).
Proof. “⇐”:
1. Suppose that Conv(K) is satisfiable. Moreover, suppose that Conv(K) is obtained from K by replacing an assertion α with a set of assertions S(α) according to Fig. 3.1. One can easily verify that S(α) |= α. Then, Conv(K) |= α. Moreover, by construction, we have that: Conv(K) ⊇ K \ {α}. Thus, since each model of Conv(K) is also a model of α, then each model of Conv(K) is also a model of K, proving that K is satisfiable. 2. Let Conv(K) be not satisfiable. Then, the claim trivially holds. Thus, let us suppose that Conv(K) is satisfiable. Moreover, let q be a conjunctive query and ~t be a tuple of constants such that ~t ∈ ans(q, Conv(K)), i.e. Conv(K) |= q(~t). We want to show that ~t ∈ ans(q, Conv(K)), i.e. K |= q(~t). Since we showed previously that Conv(K) |= K, i.e. each model of Conv(K) is also a model of K, then from Conv(K) |= q(~t), it follows that K |= q(~t).
“⇒”:
1. Suppose that K is satisfiable and, by contradiction, that Conv(K) is not satisfiable. Moreover, suppose that Conv(K) is obtained from K by replacing an assertion α with a set of assertions S(α) according to Fig. 3.1. Say, for instance that α = B ⊑ ∃R.C ∈ K. Then, α does not belong to Conv(K), which in contrast contains the set of the following assertions: B ⊑ ∃Raux ∃Raux − ⊑ C Raux ⊑ R where Raux is an new auxiliary role. Let I be a model of K with assignment µ. Note that such a model exists since K is satisfiable. Thus, we can construct an interpretation I ′ , by setting I ′ = I and then extending I ′ as follows:
CHAPTER 3. THE LANGUAGE
40
Y
X⊑Y replaced by
∃R.C
X ⊑ ∃Raux ∃Raux − ⊑ C Raux ⊑ R
δF (UC )
X ⊑ δ(UCaux ) ρ(UCaux ) ⊑ F UCaux ⊑ UC
∃δF (UR )
X ⊑ ∃δ(URaux ) ρ(URaux ) ⊑ F URaux ⊑ UR
∃δF (UR
)−
X ⊑ ∃δ(URaux )− ρ(URaux ) ⊑ F URaux ⊑ UR
Y (c) replaced by Yaux (c) Yaux ⊑ ∃Raux ∃Raux − ⊑ C Raux ⊑ R Yaux (c) Yaux ⊑ δ(UCaux ) ρ(UCaux ) ⊑ F UCaux ⊑ UC Yaux (c) Yaux ⊑ ∃δ(URaux ) ρ(URaux ) ⊑ F URaux ⊑ UR Yaux (c) Yaux ⊑ ∃δ(URaux )− ρ(URaux ) ⊑ F URaux ⊑ UR
Y (c, d) replaced by
Y (c, d, e) replaced by
−
−
−
−
−
−
−
−
Yaux (c, d) Yaux ⊑ δ(URaux ) ρ(URaux ) ⊑ F URaux ⊑ UR Yaux (c) Yaux ⊑ δ(URaux )− ρ(URaux ) ⊑ F URaux ⊑ UR
X ⊑ δ(URaux ) ρ(URaux ) ⊑ F URaux ⊑ UR
−
X ⊑ δ(URaux )− ρ(URaux ) ⊑ F URaux ⊑ UR
−
¬A | ¬D | ∃Q | δ(UC ) | ρ(UC ) | ρ(UR ) | ⊤C | ⊤D | rdf DataT ype
−
Yaux (c) Yaux ⊑ Y
−
−
¬UC | ¬Q
−
−
Yaux (c, d) Yaux ⊑ Y
−
¬UR
−
−
−
Yaux (c, d, e) Yaux ⊑ Y
δF (UR )
δF (UR
)−
Figure 3.1: Rules for computing Conv(K)
−
−
3.3. DL-LITEA
41 ′
′
′
I – for each (o1 , o2 ) ∈ RI such that o1 ∈ B I , we set (o1 , o2 ) ∈ Raux
where o1 , o2 denote objects in ∆O I . Since I and I ′ differ only because I ′ , then I ′ satisfy all the assertions in K. In of the fact that (o1 , o2 ) ∈ Raux particular, I ′ is such that it satisfies the assertion B ⊑ ∃R.C. Then, it is easy to verify that I ′ satisfies the assertions above. Thus, I ′ is a model of Conv(K), which contradicts Conv(K) being not satisfiable. With a similar argument, we may prove that by applying another rule among those shown in Fig. 3.1 the result holds. 2. Let q be a conjunctive query and ~t a tuple of constants such that K |= q(~t). Again, we can suppose that K satisfiable, since otherwise the claim trivially holds. Moreover, we suppose again that Conv(K) is obtained from K by replacing α = B ⊑ ∃R.C ∈ K, with S(α) as shown above. Then, let I be a model of K with an assignment µ, and suppose to obtain from I a model I ′ of Conv(K) as shown above. Clearly, I ′ |= q(~t) since I′ I |= q(~t), I and I ′ differ only because of the fact that (o1 , o2 ) ∈ Raux and q does not involve Raux by hypothesis. Thus, to prove the claim, we need to prove that there exists no model of Conv(K) such that its restriction to the expressions used in K does not satisfy an assertion in K. By contradiction, let I ′′ be a model of Conv(K) not satisfying an assertion β in K. Two cases are possible: – either β 6= α; but then we obtain a contradiction, since, by construction, Conv(K) is obtained from K by replacing α with S(α), and thus, I ′′ satisfies all assertions in K different from α; – or β = α; but since S(α) |= α and since I ′ satisfies S(α), we obtain a again a contradiction. Again, with a similar argument, we may prove that by applying another rule among those shown in Fig. 3.1 the result holds.
Proposition 3.3.4 Let K be a DL-LiteF RS KB. Then, there always exists a basic DL-LiteF RS KB K′ that is equivalent to K from the point of view of satisfiability and query answering over K. Moreover, K′ can be computed in PT IME in the size of K.
Proof. The proof is based on the following observation: ❒ for each DL-LiteA KB K, Conv(K) is a basic KB; ❒ from Lemma 3.3.3, Conv(K) is equivalent to K from the point of view of satisfiability and query answering over K; ❒ by construction of Conv(K), for each assertion in K, at most one rule in Fig. 3.1 is applied, thus proving the PT IME complexity.
Even though basic DL-LiteF RS KBs have a form that recalls that of DL-LiteF and DL-LiteR , they allow for unrestrictedly merging the features of both these logics.
42
CHAPTER 3
From results of [30], it follows that query answering over basic DL-LiteF RS is not in L OG S PACE w.r.t. data complexity anymore, and hence DL-LiteF RS loses the most interesting computational feature for ontology-based DIS query answering. Thus, we next define a new DL called DL-LiteA , starting from DL-LiteF RS , and requiring on one hand, that KBs be expressed in the basic form, on the other hand, that the use of functionality be restricted. Definition 3.3.5 A DL-LiteA knowledge base K = hT , Ai is a basic DL-LiteF RS KB such that T satisfies the following conditions: 1. for every role inclusion assertion Q ⊑ R in T , where R is an atomic role or the inverse of an atomic role, the assertions (funct R) and (funct R− ) are not in T ; 2. for every concept attribute inclusion assertion UC ⊑ VC in T , where VC is an atomic concept attribute, the assertion (funct VC ) is not in T ; 3. for every role attribute inclusion assertion UR ⊑ VR in T , where VR is an atomic role attribute, the assertion (funct VR ) is not in T . Roughly speaking, a DL-LiteA knowledge base imposes the condition on the global schema that no functional role can be specialized, by using it in the right-hand side of role inclusion assertion. The same condition is also imposed on every functional (role or concept) attribute. As we will show later, such limitation is sufficient to guarantee that query answering can be reduced to first-order query evaluation over a database. Example 3.3.6 Clearly, the KB=hT , Ai such that T is the TBox of Ex. 3.1.1 and A is the ABox of Ex. 3.1.4 satisfies the conditions above. Thus, it is an example of a DL-LiteA KB.
Chapter 4
DL-LiteA reasoning In this chapter we study the main DL-LiteA reasoning services, i.e. KB satisfiability and query answering. Thus, after introducing the representation of a DL-LiteA KB into a relational DBMS, we present preliminary results that lead us to algorithms for (i) checking DL-LiteA KB satisfiability, and (ii) solving query answering, both relying on the use of an SQL engine. Simultaneously, we prove the correctness of these algorithms and study their complexity. All these results provide the foundations, for the investigation of ontology-based DIS in the next chapter.
4.1 Storage of a DL-LiteA ABox Let K = hT , Ai be a DL-LiteA KB. As already discussed, we will show that DL-LiteA keeps the nice property of the DLs in the DL-Lite family, of allowing to delegate query processing, after a preprocessing phase which is independent of the data, to an underlying DBMS managing the data layer, i.e. the ABox. Thus, all along this chapter, we assume that, given a TBox, a DL-LiteA KB K = hT , Ai is represented as a database DB presented below. Definition 4.1.1 Given a TBox T , and a database DB with domain Γ ∪ V, we say that DB represents a KB K = hT , Ai in the context of T , if DB is as follows: ❒ for each atomic concept A, each unary relation A and each tuple (co ) in A, there exists one membership assertion A(co ) in A; ❒ for each atomic value-domain D, each unary relation D and each tuple (cv ) D, there exists one membership assertion D(cv ) in A; ❒ for each atomic role P , for each binary relation P and for each tuple (a1 , a2 ) in P, there exists a membership assertion P (a1 , a2 ) or P − (a2 , a1 ) in A; ❒ for each atomic role UC , for each binary relation UC , and for each tuple (b, d) in UC , there exists a membership assertion UC (b, d) in A; ❒ for each atomic role attribute UR , for each ternary relation UR , and for each tuple (a1 , a2 , d) in UR , there exists a membership assertion UR (a1 , a2 , d) in A; 43
CHAPTER 4. DL-LITEA REASONING
44
❒ for each tuple (sv ) in Fresh, there exists one soft constant sv ∈ V. Intuitively, this will let us deal with soft constants as if they were (hard) constants, without forgetting that they are not. As usual, given any first-order logic query Q, we denote as ans(Q, DB) the set of answers that are returned by the evaluation of Q over DB.
4.2 Preliminaries In this section, we present three main constructions that will be crucial for the investigation of DL-LiteA reasoning, namely the minimal model for A, the canonical model for K and the closure of negative inclusions.
4.2.1
Minimal model for a DL-LiteA ABox
Given a DL-LiteA ABox A, we denote as db(A), the (Herbrand) interpretation of A. More precisely, let db(A) be the interpretation (∆db(A) , ·db(A) ) such that ∆db(A) is the disjoint union of two domains ∆O db(A) = ΓO and ∆V db(A) = ΓV and ·db(A) is as follows: ❒ adb(A) = a, for each constant a ∈ Γ, where Γ = ΓO ∪ ΓV , ❒ Adb(A) = {a | A(a) ∈ A}, for each atomic concept A, ❒ Ddb(A) = {d | D(d) ∈ A}, for each atomic domain-value D, ❒ P db(A) = {(a1 , a2 ) | P (a1 , a2 ) ∈ A}, for each atomic role P , db(A)
= {(a1 , d) | UC (a1 , d) ∈ A}, for each atomic concept attribute UC ,
db(A)
= {(a1 , a2 , d) | UR (a1 , a2 , d) ∈ A}, for each atomic concept role UR .
❒ UC and ❒ UR
It is easy to see that db(A) is a minimal Herbrand model for A with a most general assignment µ0 w.r.t. db(A).
4.2.2
Canonical interpretation
The canonical interpretation of a DL-LiteA KB is an interpretation constructed according to the notion of chase [5]. In particular, we adapt here the notion of restricted chase adopted by Johnson and Klug in [59]. To this aim, we first introduce the notion of most general assignment. Definition 4.2.1 Let K = hT , Ai be a DL-LiteA KB. We call canonical interpretation of K the minimal interpretation can(K) = (∆can(K) , ·can(K) ) of K that satisfies the following conditions, where ∆can(K) is the disjoint union of the sets ∆O can(K) and ∆V can(K) , and we use a and v, possibly with subscript or apex, to indicate resp. an object in ∆O can(K) and a value in ∆V can(K) .
4.2. PRELIMINARIES (cr0)
∆O can(K) ⊇ ∆V can(K) ⊇ acan(K) = dcan(K) = Acan(K) = Dcan(K) = can(K) UC = can(K)
UR
=
P can(K) = can(K)
(cr1) If a ∈ A1
45
{a | a ∈ ΓO occurs in A} ∪ {so | so ∈ VO occurs in A}, {d | d ∈ ΓV occurs in A} ∪ {sv | sv ∈ VV occurs in A}, a, for each object constant a, d, for each value constant d, {co | A(ca ) ∈ A}, for each atomic concept A, {cv | D(cv ) ∈ A}, for each atomic value-domain D, {(a, d) | UC (a, d) ∈ A}, for each atomic concept attribute UC , {(a1 , a2 , d) | YR (a1 , a2 , d) ∈ A}, for each atomic role attribute UR , {(a1 , a2 ) | P (a1 , a2 ) ∈ A or P − (a2 , a1 ) ∈ A}, for each atomic role P can(K)
, A1 ⊑ X2 in Tp , and a ∈ / X2 can(K)
1. if X2 = A2 , then add a to A2
, then:
; can(K)
2. if X2 = ∃Q2 , where Q2 = P2 | P2− , then add (a, an ) to Q2 an is a new element of ∆O can(K) ;
, where
3. if X2 = ∃Q2 , where Q = δ(UR2 ) | δ(UR2 )− , then add (a, an , dn ) to can(K) UR 2 , where an is a new element of ∆O can(K) and dn is a new element of ∆V can(K) ; can(K)
4. if X = δ(UC ), then add (a, dn ) to UC of ∆V can(K) . can(K)
(cr2) If (a, a′ ) ∈ Q1 then:
, where dn is a new element can(K)
, ∃Q1 ⊑ X2 in Tp , where Q = P1 | P1− , and a ∈ / X2 can(K)
1. if X2 = A2 , then add a to A2
,
; can(K)
2. if X2 = ∃Q2 , where Q2 = P2 | P2− , then add (a, an ) to Q2 an is a new element of ∆O can(K) ;
, where
3. if X2 = ∃Q2 in Tp , where Q2 = δ(UR2 ) | δ(UR2 )− , then add (a, an , dn ) can(K) to UR2 , where an is a new element of ∆O can(K) and dn is a new element of ∆V can(K) ; can(K)
4. if X2 = δ(UC ), then add (a, dn ) to UC of ∆V can(K) . can(K)
(cr3) If (a, d′ ) ∈ UC1
, where dn is a new element can(K)
, δ(UC1 ) ⊑ X2 in Tp , and a ∈ / X2 can(K)
1. if X2 = A2 , then add a to A2
, then:
; can(K)
2. if X2 = ∃Q2 , where Q2 = P2 | P2− , then add (a, an ) to Q2 an is a new element of ∆O can(K) ;
, where
3. if X2 = ∃Q2 , where Q2 = δ(UR2 ) | δ(UR2 )− , then add (a, an , dn ) to can(K) Q2 , where an is a new element of ∆O can(K) and dn is a new element of ∆V can(K) ;
CHAPTER 4. DL-LITEA REASONING
46
can(K)
4. if X2 = δ(UC2 ), then add (a, dn ) to UC2 can(K)
(cr4) If (a, a′ , d′ ) ∈ Q1 can(K) a∈ / X2 then:
;
, ∃Q1 ⊑ X2 in Tp , where Q1 = δ(UR1 ) | δ(UR1 )− , and can(K)
1. if X2 = A2 , then add a to A2
; can(K)
2. if X2 = δ(UC ), then add (a, an ) to UC of ∆O can(K) ;
, where an is a new element can(K)
3. if X2 = ∃Q2 , where Q2 = P2 | P2− , then add (a, an ) to Q2 an is a new element of ∆O can(K) ;
, where
4. if X2 = ∃Q2 in Tp , where Q2 = δ(UR2 ) | δ(UR2 )− , then add (a, an , dn ) can(K) to UR2 , where an is a new element of ∆O can(K) and dn is a new element of ∆V can(K) . can(K)
(cr5) If (a1 , a2 ) ∈ Q1 can(K) Q2 , then:
, Q1 ⊑ X2 in Tp , where Q1 = P1 | P1− , and (a1 , a2 ) ∈ / can(K)
1. if X2 = P2 | P2− , then add (a1 , a2 ) to X2
; can(K)
2. if X2 = δ(UR2 ) | δ(UR2 )− , then add (a1 , a2 , dn ) to X2 is a new element of ∆V can(K) .
, where dn
can(K)
(cr6) If (a1 , a2 , d′ ) ∈ Q1 , Q1 ⊑ X2 in Tp , where Q1 = δ(UR1 ) | δ(UR1 )− , can(K) and (a1 , a2 ) ∈ / X2 , then: can(K)
1. If X2 = P2 | P2− , then add (a1 , a2 ) to X2
; can(K)
2. If X2 = δ(UR2 ) | δ(UR2 )− , then add (a1 , a2 , dn ) to X2 is a new element of ∆V can(K) . can(K)
(cr7) If d ∈ D1
can(K)
, D1 ⊑ X2 in Tp , and d ∈ / X2 can(K)
1. If X2 = D2 , then add d to D2
, then:
; can(K)
2. If X2 = ρ(UC2 ), then add (an , d) to UC2 of ∆O can(K) ;
, where an is a new element
can(K)
3. If X2 = ρ(UR2 ), then add (an , a′n , d) to UR2 elements of ∆O can(K) ; can(K)
(cr8) If (a′ , d) ∈ UC1
, where an , a′n are new
can(K)
, ρ(UC1 ) ⊑ X2 in Tp , and d ∈ / X2 can(K)
1. If X2 = D2 , then add d to D2
, where dn
, then:
; can(K)
2. If X2 = ρ(UC2 ), then add (an , d) to UC2 of ∆O can(K) ;
, where an is a new element
can(K)
3. If X2 = ρ(UR2 ), then add (an , a′n , d) to UR2 elements of ∆O can(K) .
, where an , a′n are new
4.2. PRELIMINARIES
47
can(K)
(cr9) If (a′ , a′′ , d) ∈ UR1
can(K)
, ρ(UR1 ) ⊑ X2 in Tp , and d ∈ / X2 can(K)
1. If X2 = D2 , then add d to D2
; can(K)
2. If X2 = ρ(UC2 ), then add (an , d) to UC2 of ∆O can(K) ;
, where an is a new element
can(K)
3. If X2 = ρ(UR2 ), then add (an , a′n , d) to UR2 elements of ∆O can(K) . can(K)
(cr10) If (a, d) ∈ UC1 to
, then:
, where an , a′n are new can(K)
, UC1 ⊑ UC2 in Tp , and (a, d) ∈ / UC 2
, then add (a, d)
can(K) UC 2 ; can(K)
(cr11) If (a1 , a2 , d) ∈ UR1 add (a1 , a2 , d) to
can(K)
, UR1 ⊑ UR2 in Tp , and (a1 , a2 , d) ∈ / UR 2
, then
can(K) UR 2 .
Rules in the previous definition are called chase rules. Although they are so numerous and look like complicated, intuitively they ”simply” aim at constructing a Herbrand interpretation of K satisfying the ABox and the set of P I assertions Tp .In particular, we have the following notable property of can(K). Proposition 4.2.2 Let K = hT , Ai be a DL-LiteA satisfiable KB, and µ an assignment for A. Then, for each model I = (∆I , ·I ) of K with µ, we have that there exists a homomorphism Ψ from ∆can(K) to ∆I , i.e. a function Ψ such that: ❒ for each j-tuple of objects ~t ∈ (∆can(K) )n , j ∈ 1, 2, 3, ~t ∈ X can(K) ⇒ Ψ(~tI ) ∈ X I
(4.1)
where X may denote either a concept (in which case j = 1), or a value-domain (j = 1), or a role (j = 2) or a concept attribute (j = 2), or a role attribute (j = 3) in K.
Proof. Let I = (∆I , ·I ) be a model of K with µ. We next show how to build a function Ψ from ∆can(K) to ∆I , by proceeding by induction on the construction of can(K). Simultaneously we show that Ψ is a homomorphism, i.e. Ψ satisfies 4.1. ❒ Base step: For each membership assertion α: – If α = X(sv ), where X denotes either an atomic concept or an atomic value-domain, sv ∈ V, then, by construction of can(K), we have that sv ∈ ∆can(K) , and sv ∈ X can(K) . We then set Ψ(sv ) = µ(sv ). Thus, since I is a model of α, we have that µ(sv ) ∈ X I , and 4.1 is satisfied; – If α = X(~t), where X denotes any atomic expression and ~t = (t1 , · · · , tj ) ∈ Γj , for j = 1, 2, 3, then, by construction of can(K) we have that ti ∈ ∆can(K) for each i = 1, · · · , j, and ~t ∈ X can(K) . Then, we set Ψ(~t) = ~tI . Thus, since I is a model of α, we have that ~tI ∈ X I , and 4.1 is satisfied;
CHAPTER 4. DL-LITEA REASONING
48
❒ Inductive step: Let can i (K) be the portion of can(K) after i applications of the chase rules. According to the inductive hypothesis, we have that can i (K) satisfies 4.1, i.e. for each tuple ~t ∈ ∆can(K) , ~t ∈ X can(K) ⇒ Ψ(~t)I ∈ X I . Suppose now that can i+1 (K) is the portion of can(K) that is obtained from can i (K) by application of one among the chase rules, say for instance the rule cr1. Thus suppose that a ∈ B can(K) and B ⊑ X ∈ Tp . By inductive hypothesis we have that Ψ(a) ∈ B I where Ψ(a) ∈ ∆I . Now, depending on the form of X the application of cr1 may lead to one of the following cases: – if X = A, then we have that a ∈ Acan i+1 (K) ; moreover, since I is a model of K, we have that I satisfies B ⊑ A and thus, Ψ(a) ∈ AI ; – if X2 = ∃Q2 , where Q = δ(UR2 ) | δ(UR2 )− , then we have that (a, an , dn ) ∈ can (K) Q2 i+1 , where an , dn are new elements resp. of ∆O can(K) , ∆V can(K) ; therefore, Ψ(an ) and Ψ(dn ) were not yet defined; moreover, since I is a model of Tp , then there must exist two elements o ∈ ∆O I , w ∈ ∆V I such that (Ψ(a), o, w) ∈ QI2 ; then, by setting Ψ(an ) = o and Ψ(dn ) = w, we obtain (Ψ(a), Ψ(an ), Ψ(dn )) ∈ QI2 ; – if X = ∃Q2 , where Q2 = P2 | P2− , then we have that (a, an ) ∈ can (K) Q2 i+1 , where an is a new element of ∆O can(K) ; therefore, Ψ(an ) was not yet defined; moreover, since I is a model of Tp , then there must exist an element o ∈ ∆O I such that (Ψ(a), o) ∈ QI2 ; then, by setting Ψ(an ) = o, we obtain (Ψ(a), Ψ(an )) ∈ QI2 ; can
(K)
– if X = δ(UC ), then we have that (a, dn ) ∈ Q2 i+1 , where dn is a new element of ∆V can(K) ; therefore, Ψ(dn ) was not yet defined; moreover, since I is a model of Tp ,, then there must exist an element v ∈ ∆V I such that (Ψ(a), v) ∈ UCI ; then, by setting Ψ(dn ) = v, we obtain (Ψ(a), Ψ(dn )) ∈ UCI . Thus, we proved that if can i+1 (K) is obtained by application of rule cr1 then it still satisfies 4.1. Proceeding analogously with the other chase rules, we can easily prove the claim.
The above proposition is very important because it proves that if K is satisfiable, then can(K) can be seen as ”representative” of all models of K with µ. As we will see, we will use this property of can(K) several times all along our proofs.
4.2.3
Closure of negative inclusions
By following the same approach of [31] we next introduce the notion of NI-closure, which results from adapting the corresponding notion of [31] to our logic. Definition 4.2.3 Let T be a DL-LiteA TBox. We call NI-closure of T , denoted by cln(T ), the TBox obtained inductively as follows: 1. all negative inclusion assertions in T are also in cln(T );
4.2. PRELIMINARIES
49
2. if B1 ⊑ B2 is in T and B2 ⊑ ¬B3 or B3 ⊑ ¬B2 is in cln(T ), then also B1 ⊑ ¬B3 is in cln(T ); 3. if E1 ⊑ E2 is in T and E2 ⊑ ¬E3 or E3 ⊑ ¬E2 is in cln(T ), then also E1 ⊑ ¬E3 is in cln(T ); 4. if Q1 ⊑ Q2 is in T and ∃Q2 ⊑ ¬B or B ⊑ ¬∃Q2 is in cln(T ), then also ∃Q1 ⊑ ¬B is in cln(T ); − 5. if Q1 ⊑ Q2 is in T and ∃Q− 2 ⊑ ¬B or B ⊑ ¬∃Q2 is in cln(T ), then also − ∃Q1 ⊑ ¬B is in cln(T );
6. if Q1 ⊑ Q2 is in T and Q2 ⊑ ¬Q3 or Q3 ⊑ ¬Q2 is in Tn , then also Q1 ⊑ ¬Q3 is in cln(T ); 7. if one of the assertions ∃Q ⊑ ¬∃Q, ∃Q− ⊑ ¬∃Q− , or Q ⊑ ¬Q is in cln(T ), then all three such assertions are in cln(T ); 8. if UC1 ⊑ UC2 is in T and δ(UC2 ) ⊑ ¬B or B ⊑ ¬δ(UC2 ) is in cln(T ), then also δ(UC1 ) ⊑ ¬B is in cln(T ); 9. if UC1 ⊑ UC2 is in T and ρ(UC2 ) ⊑ ¬E or E ⊑ ¬ρ(UC2 ) is in cln(T ), then also ρ(UC1 ) ⊑ ¬E is in cln(T ); 10. if UC1 ⊑ UC2 is in T and UC2 ⊑ ¬UC3 or UC3 ⊑ ¬UC2 is in cln(T ), then also UC1 ⊑ ¬UC3 is in cln(T ); 11. if one of the assertions ρ(UC ) ⊑ ¬ρ(UC ), δ(UC ) ⊑ ¬δ(UC ), or UC ⊑ ¬UC is in cln(T ), then all three such assertions are in cln(T ); 12. if UR1 ⊑ UR2 is in T and ρ(UR2 ) ⊑ ¬E or E ⊑ ¬ρ(UR2 ) is in cln(T ), then also ρ(UR1 ) ⊑ ¬E is in cln(T ); 13. if UR1 ⊑ UR2 is in T and δ(UR2 ) ⊑ ¬P or P ⊑ ¬δ(UR2 ) is in cln(T ), then also δ(UR1 ) ⊑ ¬P is in cln(T ); 14. if UR1 ⊑ UR2 is in T and δ(UR−2 ) ⊑ ¬P or P ⊑ ¬δ(UR−2 ) is in cln(T ), then also δ(UR−1 ) ⊑ ¬P is in cln(T ); 15. if UR1 ⊑ UR2 is in T and UR2 ⊑ ¬UR3 or UR3 ⊑ ¬UR2 is in cln(T ), then also UR1 ⊑ ¬UR3 is in cln(T ); 16. if one of the assertions ρ(UR ) ⊑ ¬ρ(UR ), δ(UR ) ⊑ ¬δ(UR ), or UR ⊑ ¬UR is in cln(T ), then all three such assertions are in cln(T ). Example 4.2.4 Consider the TBox of Example 3.1.1. Clearly, the NI-closure of T is the following set of NI: manager ⊑ ¬∃δ(until)
(4.2)
manager ⊑ ¬tempEmp
(4.3)
CHAPTER 4. DL-LITEA REASONING
50
4.3 Satisfiability of a DL-LiteA KB In this section , we investigate the satisfiability of a DL-LiteA KB. To this aim, by following an approach that is similar to that of [31], we first show some notable properties of the notions introduced in the previous section and then we show how to exploit such properties to provide an algorithm for checking DL-LiteA KB satisfiability. Finally, we study its complexity.
4.3.1
Foundations of the algorithm for satisfiability
The algorithm for checking the satisfiability of a DL-LiteA KB strongly relies on the use of notions introduced in the previous section. Thus, we start by giving results that put in relation all these notions with DL-LiteA KB satisfiability. Specifically, the lemma below shows that the canonical model of a KB always satisfies the set of positive inclusions. Moreover, it shows that can(K) is a model of the ABox with an assignment µ0 . Lemma 4.3.1 Let K = hT , Ai be a DL-LiteA KB. Then, we have that: 1. can(K) |= Tp , where Tp denote the set of positive inclusions in T ; 2. there exists a most general assignment µ0 for A w.r.t. can(K), such that: can(K) |= A[µ0 ].
Proof. It is easy to see that 1 follows directly from the definition of can(K). Indeed, can(K) is built in such a way that every PI in Tp is satisfied (cf. rules cri, for i = 1, ...10). Let us now consider 2. By rule cr0, we have that can(K) |= α for each membership assertion α not involving soft constants. Now, let us construct an assignment µ0 as follows: for each s ∈ V, µ0 (s) = s. Clearly, by construction, µ0 is a most general assignment for A w.r.t. can(K). Moreover, can(K) |= α[µ0 ]. In contrast with the previous lemma, the following shows that the canonical model of a KB satisfies the set of functionality assertions if and only if the minimal model for A is a model for such a set of assertions. Lemma 4.3.2 Let K = hT , Ai be a DL-LiteA KB. Then, can(K) |= Tk ⇐⇒ db(A) |= Tk , where Tk denotes the set of functionality assertions in T .
Proof. “⇒”: By construction, db(A) exactly coincides with the interpretation obtained by applying the only rule cr0. Thus, if can(K) satisfies Tk , then, clearly, db(A) satisfies Tk . “⇐”: Suppose that db(A) satisfies Tk . Then, by induction on the construction of can(K), we show that can(K) satisfies Tk . ❒ Base step: By hypothesis, db(A) satisfies Tk .
4.3. SATISFIABILITY OF A DL-LITEA KB
51
❒ Inductive step: Let can i (K) be the portion of can(K) obtained after i applications of the chase rules. Suppose that can i (K) |= Tk and, by contradiction, suppose that can i+1 (K) 6|= Tk , where can i+1 (K) is obtained from can i (K) by applying one of the rules crj, with j ∈ {1, · · · , 11}. It is worth starting by noticing that not all the rules can cause violation of a functionality assertion. In particular, there are three types of safe rules: – first type: rules triggered by an inclusion assertion between concepts or value-domains, whose right-hand side does not involve any role, nor concept attribute, nor role attribute; – second type: rules triggered by an inclusion assertion between concepts or value-domains, whose right-hand side involves a role, or a concept attribute, or a role attribute, that is not involved in any functionality assertion in Tk ; – third type: rules triggered by an inclusion assertion among roles, concept attributes, or role attributes. For all these types of rules, assuming that can i+1 (K) violates a functionality assertion α would make us conclude that α is not satisfied already in can i (K), which would lead to a contradiction. Indeed, the application of a rule of the first type does not imply the modification of the interpretation of any role, concept attribute, nor role attribute. Concerning the rules of the third type, their application implies the modification of the interpretation of either a role, or a concept attribute, or a role attribute that is not involved in any functionality assertion. Finally, a similar argument holds for rules of the third type, since, by definition of DL-LiteA , the right-hand side of inclusions between roles, concept attributes and role attributes does not involve any expression that is also involved in a functionality assertion in Tk . Therefore, the only rules that may cause can i+1 (K) to violate Tk , are the rules triggered by the presence of a concept inclusion assertion whose right-hand side involves a role P or P − , a concept attribute UC or a role attribute UR such that Tk contains resp. the assertions (funct P ), (funct P − ), or (funct UC ) or (funct UR ). For instance, let us assume that can i+1 (K) is obtained from can i (K) by application of rule cr1, can (K) where we assume that a ∈ A1 i , A1 ⊑ X2 in Tp , and a ∈ / ∃P can i+1 (K) . Moreover, we assume that there exists a functionality assertion α involving P which is not satisfied by can i+1 (K). However, – in the case in which α = (funct P ), for α to be violated, there must exist two pairs of objects (x, y), (x, z) ∈ P can i+1 (K) such that y 6= z; since we have that (o, on ) ∈ P can i+1 (K) and o ∈ / ∃P can i (K) , there exists no pair (o, o′ ) ∈ P can i+1 (K) such that o′ 6= on . Hence, we should conclude that the pairs (x, y), (x, z) we are looking for, are such that (x, y), (x, z) ∈ P can i (K) , but this would lead to a contradiction; – in the case in which α = (funct P − ), for α to be violated, there must exist two pairs of objects (y, x), (z, x) ∈ P can i+1 (K) such that y 6= z; since on is a fresh object, we can conclude that there exists no pair (o′ , on ) ∈
52
CHAPTER 4. DL-LITEA REASONING P can i+1 (K) such that o′ 6= o. Hence, we should conclude that the pairs (y, x), (z, x) we are looking for, are such that (y, x), (z, x) ∈ P can i (K) , but this would lead to a contradiction. Clearly, with a similar argument we may prove that the claim holds also when other apparently ”not safe” chase rules are applied to can i (K).
In the same spirit of [31], we continue characterizing when the canonical model of a KB satisfies the assertions forming the KB. Until now we have considered the set of PI’s and the set of functionality assertions. Let us now consider the case NI’s. To this aim, we need to use the notion of NI-closure introduced in the previous section. Lemma 4.3.3 Let K = hT , Ai be a DL-LiteA KB. Then, can(K) |= Tni ⇐⇒ db(A) |= cln(Tni ). where Tni denotes a set of negative inclusions in T .
Proof. “⇒”: Suppose that can(K) is a model of Tni and suppose by contradiction that db(A) does not satisfy an assertion in cln(Tni ). Since can(K) is a model of Tni and cln(Tni ) denotes the set of assertions that are logically implied by Tni , we have that can(K) |= cln(Tni ). But then we obtain a contradiction since db(A) coincides with the portion of can(K) obtained by application of rule cr0. “⇐”: Suppose that db(A) is a model of cln(Tni ). We prove that can(K) is a model of Tni by induction on the structure of can(K). ❒ Base step: By hypothesis, db(A) satisfies Tni . ❒ Inductive step: Let can i (K) be the portion of can(K) obtained after i applications of the chase rules. Suppose that can i (K) |= Tni and, by contradiction, suppose that can i+1 (K) 6|= Tni , where can i+1 (K) is obtained from can i (K) by applying one of the chase rules. For instance, suppose that can i+1 (K) is obtained by application of rule cr1 to can i (K), where we assume that there can (K) exists a ∈ ∆can(K) such that a ∈ A1 i , A1 ⊑ X2 be a PI in Tp and can (K) can (K) a ∈ / X2 i . Then, we have that a ∈ X2 i+1 . Now, if can i+1 (K) is not a model of Tni , then there must exist a NI α in Tni that is not satisfied by can i+1 (K). However, by hypothesis, can i (K) and can i+1 (K) differ only for can (K) can (K) the fact that a ∈ / A2 i and a ∈ A2 i+1 . Then, in order for can i+1 (K) to violate α, this must involve X2 . Thus, for instance, α may assume the form can (K) Y1 ⊑ ¬A2 where a ∈ Y1 i+1 . But then, since Y1 ⊑ ¬A2 and A1 ⊑ A2 , then also A1 ⊑ ¬Y1 belongs to cln(Tni ). Thus, we obtain a contradiction can (K) can (K) since a ∈ Y1 i and a ∈ A1 i+1 , which contradicts that can i (K) satisfies cln(Tni ). Clearly, with a similar argument we can prove the inductive step even in those cases in which can i+1 (K) is obtained by can i (K) by applying one among the other chase rules.
4.3. SATISFIABILITY OF A DL-LITEA KB
53
Next, by giving the two following propositions, we put everything together and we set up the basis of the algorithm for satisfiability. Proposition 4.3.4 Let K = hT , Ai be a DL-LiteA KB. Then, ∃µ0 , can(K) |= K[µ0 ] ⇐⇒ db(A) |= Tk ∧ cln(T ), where µ0 is a most general assignment for A w.r.t. can(K).
Proof. The proof follows directly from lemmas 4.3.2,4.3.3 and 4.3.1. Proposition 4.3.5 Let K = hT , Ai be a DL-LiteA KB. K is satisfiable ⇐⇒ ∃µ0 | can(K) |= K[µ0 ], where µ0 is a most general assignment for A w.r.t. can(K).
Proof. “⇐”: Trivially, if there exists an assignment µ0 such that can(K) is a model of K with µ0 , then K is satisfiable. “⇒”: Suppose that K is satisfiable and, by contradiction, that there exists no most general assignment µ0 for A w.r.t. can(K) such that can(K) is a model of K with µ0 . Since K is satisfiable, by Proposition 3.1.8, there exists an interpretation I and a most general assignment µ0 for A w.r.t. I such that I is a model of K with µ0 . Then, since can(K) is not a model of K with µ0 , we have in particular that can(K) 6|= K[µ0 ]. Therefore, by Proposition 4.3.4, we have that either db(A) 6|= Tk or db(A) 6|= cln(Tni ). ❒ Suppose that db(A) violates, for instance, a role functionality assertion (funct P ). By construction of db(A) there exist a1 , a2 , a3 ∈ Γ, a1 6= a2 , such that P (a1 , a2 ), P (a1 , a3 ) ∈ A. But then no model of K satisfies A, thus contradicting the hypothesis that A is satisfiable. Clearly, we would obtain a contradiction also by supposing that db(A) violates another type of functionality assertion. ❒ Suppose now that db(A) violates a NI assertion in cln(Tni ). By Lemma 4.3.3, we have that can(K) 6|= Tni . Suppose then for instance that can(K) does not satisfy the NI A ⊑ ¬B. Then, there must exist a, b ∈ ∆can(K) such that a ∈ Acan(K) , and a ∈ B can(K) . But then, by Proposition 4.2.2, since K is satisfiable and I is a model of K with µ0 , then there exists a homomorphism Ψ from ∆can(K) to ∆I . Thus we have Ψ(a) ∈ AI and Ψ(a) ∈ B I , which clearly contradicts the fact that I is a model of T since I does not satisfy the NI A ⊑ ¬B that is logically implied by T .
Clearly, from the previous two propositions, we are finally able to state the following crucial theorem, that is at the heart of our algorithm for checking DL-LiteA KB satisfiability:
CHAPTER 4. DL-LITEA REASONING
54
Input:DL-LiteA TBox T and database DB representing K = hT , Ai in the context of T Output: true or false (1) for each F = (funct X) ∈ Tk do Q ← ViolateFunct(F ); Q′ ← RewDB(Q); if (ans(Q′ , DB) = true) then return false (2) for each N I = X1 ⊑ ¬X2 ∈ cln(Tni ) do Q ← ViolateNI(N I)); Q′ ← RewDB(Q); if (ans(Q′ , DB) = true) then return false return true Figure 4.1: Algorithm Sat (K) Theorem 4.3.6 Let K = hT , Ai be a DL-LiteA KB. Then K is satisfiable ⇐⇒ db(A) |= Tk ∧ cln(T ).
Proof. Trivial, from Proposition 4.3.4 and Proposition 4.3.5. Clearly the above theorem is crucial to our scope, since it allows for reducing the satisfiability check of a DL-LiteA KB to the problem of checking whether a finite model satisfies a set of assertions.
4.3.2
Satisfiability algorithm
Given all the previous results, we are now ready to define, in Fig. 4.3.2, the algorithm Sat for checking the satisfiability of a DL-LiteA KB. Informally, the algorithm takes as input a DL-LiteA TBox and a database DB representing a DL-LiteA KB K = hT , Ai in the context of T as discussed in Section 4.1, and returns true or false by proceeding as follows. For each functionality assertion F in T , the algorithm starts by constructing a first-order logic Q query that checks whether the functionality assertion, or resp. the N I assertion, is violated in the minimal model db(A) for the ABox. To this aim, it calls a function ViolateFunct(F ), shown in Fig. 4.3.2, that takes as input any functionality assertion of the form F = (funct X) and returns the boolean first-order logic query Q that asks whether there exists two couples of constants that are both interpreted in db(A) as belonging to X and such that they violate F . Similarly, for each N I X in cln(T ), the algorithms calls the function ViolateNI[X], shown in Fig. 4.3.2, that takes as input any negative inclusion assertion of the form N I = X1 ⊑ ¬X2 in the NI-Closure of T and returns the boolean first-order logic query Q that asks whether there exists a tuple of constants that is interpreted in db(A) as belonging simultaneously to both X1 and X2 . Afterwards, by the use of the function RewDB, Q is rewritten in terms of the database DB. More precisely, given a first-order logic query Q, RewDB builds a query over DB that is
4.3. SATISFIABILITY OF A DL-LITEA KB
55
Input: DL-LiteA functionality assertion F = (funct X) Output: boolean query Case of X: X = P: return q() : −P (w, x) ∧ P (w, y) ∧ x 6= y; X = P −: return q() : −P (x, w) ∧ P (y, w) ∧ x 6= y; X = UC : return q() : −UC (w, x) ∧ UC (w, y) ∧ x 6= y; X = UR : return q() : −UR (w1 , w2 , x) ∧ UR (w1 , w2 , y) ∧ x 6= y;
Figure 4.2: Function ViolateFunct obtained from Q by replacing each occurrence of an atomic expression X (either concept, or value-domain, or role, or concept attribute, or role attribute) with the corresponding relation X in DB (note that, by hypothesis, for each DL-LiteA expressions there exists a corresponding relation in DB). Finally, each rewritten query Q′ is evaluated over DB, and the algorithm returns false if at least one such evaluation returns false, true otherwise. We have the following lemma: Lemma 4.3.7 Let KhT , Ai be a DL-LiteA KB. Then: K is unsatisfiable if and only if Qdb(A) = true, for some Q such that Q = ViolateFunct(X), for some functionality assertion X ∈ Tk , or Q = ViolateNI(X) for some NI assertion X ∈ cln(T ).
Proof. The proof follows directly from Theorem 4.3.6. Given the previous lemma, by construction of the Algorithm Sat, we can immediately claim the correctness of Algorithm Sat(K): Theorem 4.3.8 Let K be a DL-LiteA KB. Then K is satisfiable if and only if Sat(K) = true. From the results in the previous section we can establish the computational complexity characterization of the satisfiability problem for a DL-LiteA KB. The proof is omitted since it can be straightforwardly adapted from [31]. Theorem 4.3.9 Given a DL-LiteA KB K, Sat(K) is L OG S PACE in the size of the database used to represent K (data complexity) and PT IME in the size of the whole knowledge base (combined complexity).
56
CHAPTER 4. DL-LITEA REASONING
Input: DL-LiteA NI N I =X1 ⊑ ¬X2 Output: boolean query Case of N I: N I concept inclusion: body ← {}; for i = 1, 2 do Case of Xi : Xi = Ai : body ← body ∧ Ai (x); Xi = ∃Pi : body ← body ∧ Pi (x, v); − Xi = ∃Pi : body ← body ∧ Pi (v, x); Xi = δ(UCi ): body ← body ∧ UCi (x, v); Xi = ∃δ(UCi ): body ← body ∧ URi (x, v, w); Xi = ∃δ(UCi )− : body ← body ∧ URi (v, x, w); N I domain-value inclusion: body ← ∅; for i = 1, 2 do Case of Xi : Xi = Di : body ← body ∧ Di (x); Xi = ρ(UCi ): body ← body ∧ UCi (v, x); Xi = ρ(URi ): body ← body ∧ URi (v, w, x); N I role inclusion: body ← ∅; for i = 1, 2 do Case of Xi : Xi = Pi : body ← body ∧ Pi (x, y); − Xi = Pi : body ← body ∧ Pi (y, x); Xi = δ(URi ): body ← body ∧ URi (x, y, v); Xi = δ(URi )− : body ← body ∧ URi (y, x, v); N I concept attribute inclusion (i.e. X1 = UC1 and X2 = UC2 ): body ← UC1 (x, y) ∧ UC2 (x, y); N I role attribute inclusion (i.e. X1 = UR1 and X2 = UR2 ): body ← UR1 (x, y, z) ∧ UC2 (x, y, z); return q() : −body; Figure 4.3: Function ViolateNI
4.4. QUERY ANSWERING OVER DL-LITEA KB
57
4.4 Query answering over DL-LiteA KB In what follows, as we did for satisfiability, we start by presenting preliminary results that are at the heart of the algorithm for query answering over a DL-LiteA KB. Then, after a discussion about the relation between query answering in DL-LiteA and query answering in other DLs of the DL-Lite family, we present our query answering algorithm and discuss its correctness and complexity.
4.4.1
Foundations of query answering algorithm
Similarly to satisfiability, the algorithm for solving query answering over a DL-LiteA KB relies on the existence of the canonical model and on its properties. Thus, we start by giving a crucial result that relates the canonical model to DL-LiteA KB query answering. Specifically, the lemma below shows that given a union of conjunctive queries Q over K, if we were able to query the canonical model of a KB, then we would obtain all the answers to Q. Lemma 4.4.1 Let K be a satisfiable DL-LiteA KB, and let Q be a union of conjunctive queries over K, of arity n. Moreover, let m be the number of distinct soft constants sj occurring in A. Then, ans(Q, K) = {~t = (t1 , · · · , tn ) | ~tcan(K) ∈ Qcan(K) ∧ ∃µ0 , ∀i ∈ {1, · · · , n}, ∀j ∈ {1, · · · , m}, can(K) ti 6= µ0 (sj )} where µ0 denotes a most general assignment for A w.r.t. can(K).
Proof. Let ~t be a tuple of constants in Γ. First, suppose ~t ∈ ans(Q, K). Since KB is satisfiable, by Proposition 3.2.2 we have that for each model I of K there exists a most general assignment µ0 for A such that (i) tIi 6= µ0 (sj ) for each i ∈ {1, · · · , n} and each j ∈ {1, · · · , m}, (ii) for all models I of K with µ0 , we have that ~tI ∈ QI . Moreover, by Proposition 4.3.5, since KB is satisfiable we have that can(K) is a model of K with some most general assignment µ0 ′ of A w.r.t. can(K). Thus, by Proposition 3.1.7, we have that can(K) mod K[µ0 ] and since ~t ∈ ans(Q, K), ~tcan(K) ∈ Qcan(K) . Conversely, suppose ~tcan(K) ∈ Qcan(K) , for some most general assignment µ0 . Let Q be the union of conjunctive queries Q = {q1 , . . . , qk } with qi defined as qi (x~i ) ← conji (x~i , y~i ) for each i ∈ {1, . . . , k}. Then, there exists i ∈ {1, . . . , k} such that exists an assignment σ : V → ∆can(K) that maps the variables V occurring in conji (~t, y~i ) to objects of ∆can(K) , such that all atoms in conji (~t, y~i ) under the assignment σ evaluate to true in can(K). Now let I be a model for K with µ0 . By Proposition 4.2.2, there is a homomorphism Ψ from ∆can(K) to ∆I . Consequently, the function obtained by composing Ψ and σ is a function that maps the variables V occurring in conji (~t, y~i ) to objects of the domain of I, such that all atoms in conji (~t, y~i ) under the assignment σ evaluate to true in I. Therefore, ~tI ∈ QI . Then, by applying Proposition 3.1.8 we obtain that: ~t ∈ ans(Q, K).
CHAPTER 4. DL-LITEA REASONING
58
Next, as in [31], we have a property that relates answering unions of conjunctive queries to answering conjunctive queries (the proof is omitted since it can be straightforwardly adapted from the one given in [31]). Theorem 4.4.2 Let K be a DL-LiteA KB, and Q a union of conjunctive queries over K. Then, [ ans(Q, K) = ans(qi , K) qi ∈Q
4.4.2
Query answering algorithm
As already mentioned, the query answering technique for DL-LiteA as well as for the logics of the DL-Lite family introduced in [29], crucially relies on the existence of the canonical interpretation and on the property of such an interpretation to be ”representative” of all models, as proved by Proposition 4.2.2. Moreover, from Lemma 4.4.1 it follows that query answering, similarly to satisfiability, can in principle be solved by evaluating the query over the canonical model can(K). However, since can(K) is in general infinite, we obviously avoid the construction of can(K). Rather, exactly in the same spirit of [31], our query answering method consists in first compiling the TBox into a finite reformulation of the query, that is afterwards evaluated over the minimal model db(A) of the ABox. This is achieved by applying an Algorithm PerfectRef. As we will see, the only difference of the whole approach that goes beyond the simple ”adaptation” to DL-LiteA of the query answering algorithm proposed in [31], is due to the presence of soft constants in the ABox, whose treatment requires slightly modifying the reformulated query, i.e. the query obtained by means of the PerfectRef Algorithm, before evaluating it over the source database. Note that this is consistent with the formulation of Lemma 4.4.1. According to the discussion above, we next adapt to DL-LiteA the approach proposed in [31] to solve query answering. Thus we start by presenting the algorithm for query reformulation, which is responsible for reformulating a query by compiling into the query itself the intensional knowledge in the TBox. Then, we present the complete algorithm for query answering. In order to use the reformulation technique of [31], we next define the notion of applicable inclusion assertion. Intuitively, an inclusion I is applicable to an atom g if the predicate of g is equal to the predicate on the right-hand side of I. Definition 4.4.3 Let I be a P I inclusion assertion. We say that I is applicable to the atom g and, in this case, we indicate with gr(g, I) the atom obtained from the atom g by applying I if and only if: ❒ I is a concept inclusion assertion of the form I = B1 ⊑ B and g and B are as follows: – g = A(x) and B = A, – or, g = P (x, ) and B = ∃P , – or, g = P ( , x) and B = ∃P − ,
4.4. QUERY ANSWERING OVER DL-LITEA KB
59
– or, g = UR (x, , ) and B = ∃δ(UR ), – or, g = UR ( , x, ) and B = ∃δ(UR )− , – or, g = UC (x, ) and B = δ(UC ). Then, the form of gr(g, I) depends on B1 as follows: – if B1 = A1 then gr(g, I) = A1 (x); – if B1 = ∃P1 , then gr(g, I) = P1 (x, ); – if B1 = ∃P1− , then gr(g, I) = P1 ( , x); – if B1 = ∃δ(UR1 ), then gr(g, I) = UR1 (x, , ); – if B1 = ∃δ(UR1 )− , then gr(g, I) = UR1 ( , x, ); – if B1 = δ(UC1 ), then gr(g, I) = UC1 (x, ). ❒ I is a domain-value inclusion assertion of the form I = E1 ⊑ E and g and E are as follows: – g = D(x) and E = D, – or, g = UC ( , x) and E = ρ(UC ), – or, g = UR ( , , x) and E = ρ(UR ). Then, the form of gr(g, I) depends on E1 as follows: – if E1 = D1 then gr(g, I) = D1 (x); – if E1 = ρ(UC1 ), then gr(g, I) = UC1 ( , x); – if E1 = ρ(UR1 ), then gr(g, I) = R1 ( , , x). ❒ I is a role inclusion assertion of the form I = Q1 ⊑ Q and g and Q are as follows: – g = P (x1 , x2 ) and Q = P or Q = P − , – or, g = UR (x1 , x2 , ) and Q = δ(UR ) or Q = δ(UR )− . Then, the form of gr(g, I) depends on Q1 and Q as follows: – if Q1 = P1 and Q = P , or Q1 = P1− and Q = P − , then gr(g, I) = P1 (x1 , x2 ); – if Q1 = P1− and Q = P , or Q1 = P1 and Q = P − , then gr(g, I) = P1 (x2 , x1 ); – if Q1 = δ(UR1 ), and Q = δ(UR ), or Q1 = δ(UR1 )− , and Q = δ(UR )− then gr(g, I) = R1 (x1 , x2 , ); – if Q1 = δ(UR1 ), and Q = δ(UR )− , or Q1 = δ(UR1 )− , and Q = δ(UR ) then gr(g, I) = R1 (x2 , x1 , ); ❒ I is a concept attribute inclusion assertion of the form I = UC1 ⊑ UC and g = UC (x1 , x2 ). Then, we have that gr(g, I) = UC1 (x1 , x2 ).
60
CHAPTER 4. DL-LITEA REASONING
Input: conjunctive query q, DL-LiteA TBox T Output: union of conjunctive queries PR DB over db(A) PR ← {q}; repeat PR ′ ← PR; for each q ∈ PR ′ do (a) for each g in q do for each PI I in T do if I is applicable to g then PR ← PR ∪ { q[g/gr(g, I)] } (b) for each g1 , g2 in q do if g1 and g2 unify then PR ← PR ∪ {τ (reduce(q, g1 , g2 ))}; until PR ′ = PR; return PR; Figure 4.4: Algorithm PerfectRef (q, T ) ❒ I is a role attribute inclusion assertion of the form I = UR1 ⊑ UR and g = UR (x1 , x2 , x3 ). Then, we have that gr(g, I) = UR1 (x1 , x2 , x3 ). We are now ready to define, in Fig. 4.4.2, the algorithm PerfectRef, which reformulates a conjunctive query taking into account the PIs of a DL-LiteA TBox. In the algorithm, q[g/g ′ ] denotes the conjunctive query obtained from q by replacing the atom g with a new atom g ′ . Informally, the algorithm first reformulates the atoms of each conjunctive query q ∈ PR ′ , and produces a new query for each atom reformulation (step (a)). Roughly speaking, PIs are used as rewriting rules, applied from right to left, which allow to compile away the reformulation the intensional knowledge (represented by T ) that is relevant for answering q. At step (b), for each pair of atoms g1 , g2 that unify and occur in the body of a query q, the algorithm computes the conjunctive query q ′ = reduce(q, g1 , g2 ), by applying to q the most general unifier between g1 and g2 . We point out that, in unifying g1 and g2 , each occurrence of the symbol has to be considered a different unbound variable. The most general unifier substitutes each symbol in g1 with the corresponding argument in g2 , and vice-versa (obviously, if both arguments are , the resulting argument is ). Thanks to the unification, variables that are bound in q may become unbound in q ′ . Hence, PIs that were not applicable to atoms of q, may become applicable to atoms of q ′ (in the next executions of step (a)).Finally, note that function τ applied to q ′ replaces each occurrence of an unbound variable in q ′ with the symbol . Example 4.4.4 Consider again the DL-LiteA KB K of Example 3.1.5 and the query q asking for all workers, i.e., those objects which participate to the WORKS-FOR role: q(x) ← WORKS-FOR(x, y).
4.4. QUERY ANSWERING OVER DL-LITEA KB
61
Input:DL-LiteA TBox T and database DB representing K = hT , Ai, UCQ Q Output: ans(Q, K) T ← cln(T ); if K is unsatisfiable then return AllTup(Q, K) else S Q ← qi ∈Q PerfectRef(qi , T ); Q′ ← RewDB(Q); Q′′ ← Clean(Q′ ); return ans(Q′′ , DB);
Figure 4.5: Algorithm Answer(Q, K) The result of PerfectRef(q, T ) is the following union of queries Qp : q(x) q(x) q(x) q(x) q(x)
← WORKS-FOR(x, y) ← until(x, y, z) ← tempEmp(x) ← employee(x) ← manager(x).
The evaluation of Qp over DB returns the set of certain answers to q over K. Roughly speaking, in order to return all workers, Qp looks in those concepts, roles, and role attributes, whose extension in DB, according to the knowledge specified by T , could provide objects that are workers. Clearly, as in [31], the proposition below holds. Lemma 4.4.5 Let T be a DL-LiteA TBox, let q be a conjunctive query over T , and let PR be the union of conjunctive queries returned by PerfectRef(q, T ). For every DL-LiteA ABox A such that hT , Ai is satisfiable, ans(q, hT , Ai) = PR db(A) .
Proof. The proof is an obvious adaptation of the one proposed in [31]. We are finally able to present the algorithm Answer, shown in Fig. 4.4.2, for answering a union of conjunctive queries over a KB. More precisely, the algorithm takes as input a DL-LiteA KB K = hT , Ai represented by means of a TBox T and a database DB, and a union of conjunctive queries Q of arity n, and returns a set of answers ans(Q, K). As already discussed, Answer is very similar to the algorithm presented in [31] for the computation of the certain answers to a query posed over a DL-LiteF KB (or DL-LiteR KB). Indeed, it differs from the latter only because of the use of the functions (i) RewDB, already introduced and discussed when presenting the algorithm for checking the satisfiability of a DL-LiteA KB, and (ii) Clean, responsible for constraining the answer to not include any soft constant, coherently
CHAPTER 4
62
with Lemma 4.4.1. More precisely, given a union of conjunctive queries Q, Clean proceeds as follows. For each query q in Q, it adds the set of atoms: {¬Fresh(si ) | si distinguished variable of q} Observe that, if K is unsatisfiable, then, as expected, ans(Q, K) is the set of all possible tuples of constants in K whose arity is the one of the query. We denote such a set by AllTup(Q, K). We now show the correctness of the Algorithm Answer(K, Q). Theorem 4.4.6 Let K = hT , Ai be a DL-LiteA KB, let Q be a union of conjunctive queries, let U be the set of tuples returned by Answer(Q, K), and let ~t be a tuple of constants in K. Then, ~t ∈ ans(Q, K) iff ~t ∈ U .
Proof. The proof can be straightforwardly adapted from the corresponding on in [31], by observing that PerfectRef computes the union of conjunctive queries that, once reformulated by replacing the DL-LiteA expressions with the corresponding relations in db(A), would return all the answers that would also be returned by can(K). Thus, in order to select among all such tuples, those not involving the fresh constants arbitrarily introduced by µ0 , we perform an additional selection by means of the function Clean. Clearly, as for computational complexity, we get the same bounds as those shown in [31], thus achieving our goal. In particular, we have the following: Lemma 4.4.7 Let T be a DL-LiteA TBox, and let q be a conjunctive query over T . The algorithm PerfectRef (q, T ) terminates and runs in time polynomial in the size of T . Theorem 4.4.8 Given a DL-LiteA KB K, Answer(K, Q) is PT IME in the size of the TBox, and L OG S PACE in the size of the database used to represent K (data complexity).
Chapter 5
Consistency and Query Answering over Ontology-based DIS In this chapter we investigate the main problems concerning a DL-LiteA ontologybased DIS, namely consistency (cf. Section 1.2) and query answering (cf. Section 1.3). To this aim, we start by introducing DL-LiteA ontolgy-based DIS. Then, we present an overview of our reasoning method, and finally we present the core result of this chapter, namely the modularizability of DL-LiteA reasoning services. Then, we provide algorithms for DL-LiteA DIS consistency checking and query answering, based on such a result.
5.1 DL-LiteA ontology-based DIS In this section, after discussing the notorious “impedance mismatch” problem between data and DL-LiteA ontology objects, we present the syntax and the semantics of DL-LiteA DIS.
5.1.1
Linking data to DL-LiteA objects
Ontology-based DIS provide the user with an ontology that the user can access in order to query data actually stored in several possibly autonomous and heterogeneous data sources. Since we are interested in ontology-based DIS where the global schema represents the intensional level of a DL ontology, the instances of concepts and roles in the ontology are simply an abstract representation of some real data stored in existing data sources. Therefore, the problem arises of establishing sound mechanisms for linking existing data to objects that are instances of the concepts and the roles in the ontology. To present our solution to this problem, we come back to the concept of object identifiers. These are ad hoc identifiers (e.g., constants in logic) denoting objects being instances of ontology concepts. Clearly, object identifiers are not to be confused with any data item. Moreover, even if sources may in general store both data and object identifiers, the storage of object identifiers implicitly requires some “agreement” among data sources on the form for representing them. Thus, to face the possible 63
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
64
absence of such an a-priori agreement, by tracing back to the work done in deductive object-oriented databases [56], we consider a domain of object identifiers that is built starting from data values, in particular as (logic) terms over data items. To realize this idea, we define more precisely the alphabets of constants coming into play. Specifically, while ΓV contains data value constants as before, ΓO is built starting from ΓV and a set Λ of function symbols, each one with an associated arity, i.e. the number of arguments of the function. Formally, let ΓV be an alphabet of data values. Then, we call object term an expression f (d1 , . . . , dn ) such that f ∈ Λ, arity(f ) = n, and d1 , . . . , dn ∈ ΓV , where n > 0. In other words, object terms are constructed by function symbols applied to data value constants. We then denote as ΓO (Λ, ΓV ) the alphabet of object terms built on top of Λ and ΓV . Thus, we can now define a DL-LiteA KB over the set of data values ΓV and the set of object terms ΓO (Λ, ΓV ). Clearly, the syntax and the semantics of DL-LiteA expressions and TBox do not need to be modified. Concerning the ABox, since it is now specified by using the alphabet Γ that is the disjoint union of the alphabet of ΓV and ΓO (Λ, ΓV ), it consists of a finite set of membership assertions of the form: A(o),
A(so ),
D(d),
D(sv ),
P (o, p),
UC (o, d),
UR (o, p, d)
where o and p are object terms in ΓO (Λ, ΓV ), so , sv are soft constants resp. in VO , VV (as before), and c is a constant in ΓV . To define the semantics of a DL-LiteA ABox as above, we simply define an assignment for A and an interpretation I = (∆I , ·I ) as before. It is worth noting, however, that ·I now assigns a different element of ∆IO to every object term in ΓO (Λ, ΓV ) (i.e., we enforce the unique name assumption also on object terms). Formally, this means that I is such that: ❒ for all o ∈ ΓO (Λ, ΓV ), we have that oI ∈ ∆IO ; ❒ for all o, p ∈ ΓO (Λ, ΓV ), we have that o 6= p implies oI 6= pI . Finally, as for the query language, a conjunctive query over a DL-LiteA KB using object terms is an expression q(~x) ← conj(~x, ~y ) such that atoms in conj(~x, ~y ) can have the form: A(xo ),
P (xo , yo ),
D(xv ),
UC (xo , xv ),
or UR (xo , yo , xv )
where A, P, D, UC and UR are resp. an atomic concept, an atomic role, an atomic value-domain, an atomic concept attribute, and an atomic role attribute, xv is a value variable in ~x, and xo , yo may be, besides object variables as for DL-LiteA , also object terms, called variable object terms, where value variables appear in place of value constants. Obviously, union of conjunctive queries can be defined accordingly. Note that, from the point of view of the semantics, conjunctive queries are interpreted exactly as for the case of DL-LiteA KBs presented in the previous chapter.
5.1.2
Logical framework for DL-LiteA DIS
Let us now turn our attention to the problem of linking objects in the ontology to the data in the sources. To this end, we assume that data sources are wrapped into
5.1. DL-LITEA ONTOLOGY-BASED DIS
65
a set of relational sources D. Note that this assumption is indeed realistic, as many data federation tools that provide exactly this kind of service are currently available (cf. Section 2.1). In this way, we can assume that all relevant data are virtually represented and managed by a relational data engine, and that we can query data by using SQL. In the following, we make the following assumptions: ❒ the set of sources D is independent with respect to the ontology; in other words, our aim is to link the ontology to a collection of data that live autonomously, and have not been structured with the purpose of storing the ontology instances; ❒ the set of sources D is characterized in terms of a set of schemas and instances, where each schema is a specification of one relational schema (i.e., the relation name and the collection of its attributes) for one source in D, and each source is formed by a set of tuples; ❒ all value constants stored in the set of sources D belong to ΓV ; ❒ ans(ϕ, D) denotes the set of tuples (of the arity of ϕ) of value constants returned as the result of the evaluation of the SQL query ϕ over the set of data sources D. We are now able to define DL-LiteA ontology-based DIS, according to the logical framework presented in Section 1.1. Given an alphabet of value constants ΓV and an alphabet of function symbols Λ, a DL-LiteA ontology-based data integration system (shortly referred as DL-LiteA DIS) is characterized by the triple Π = hG, S, Mi such that: ❒ G is a DL-LiteA TBox; note that G is in fact the intensional level of the ontology; ❒ S is a set of relations {S1 , · · · , Sn } over ΓV , for n ≥ 1; ❒ M is a set of sound mappings partitioned into two sets, Mt and Ma , where: – Mt is a set of assertions, called typing assertions, each one of the form Φ ; Ti where Φ is a query over D denoting the projection of one relation over one of its attributes, and Ti is one of the DL-LiteA data types, – Ma is a set of assertions, called mapping assertions, each one of the form Φ;Ψ where Φ is a first-order logic query over D of arity n, and Ψ is a DL-LiteA conjunctive query over G of arity n, without non-distinguished variables, that possibly includes terms in Γ.
66
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
We briefly comment on the assertions in M as defined above. Typing assertions are used to assign appropriate types to constant values appearing in the set of data sources. Basically, these assertions are used for interpreting the values stored at the sources in terms of the types used in the ontology. Mapping assertions, on the other hand, are used to map data in the data sources to concepts, roles, and attributes in the ontology. It is worth noting that now that we have object terms, the data layer underlying a DL-LiteA DIS contains only data, whereas object identifiers are virtually built on top of this data. Thus, autonomous data sources can effectively provide their portion of data and contribute to the ontology instance-level, without being required to agree on any particular object identification scheme. We next give an example of DL-LiteA DIS.
Example 5.1.1 Let Λ = {pers, proj, mgr}, where pers, proj and mgr are function symbols of arity 1. Consider the DL-LiteA DIS Π = hG, S, Mi such that:
❒ G is the TBox of Example 3.1.1;
❒ S = {S1 , S2 , S3 , S4 }, with the following signature: S1 [SSN:STRING,PROJ:STRING, D:DATE], S2 [SSN:STRING,NAME:STRING], S3 [C:STRING,NAME:STRING], S4 [C:STRING,SSN:STRING]
❒ M = {Mt , Ma }, where:
– Mt is such that: ∃y, z.S1 (x, y, z) ; xsd:string(x) ∃y.S2 (x, y) ; xsd:string(x) ∃y.S3 (x, y) ; xsd:string(x) ∃y.S4 (x, y) ; xsd:string(x) ∃y, z.S1 (y, x, z) ; xsd:string(x) ∃y.S2 (y, x) ; xsd:string(x) ∃y.S3 (y, x) ; xsd:string(x) ∃y.S4 (y, x) ; xsd:string(x) ∃y, z.S1 (y, z, x) ; xsd:date(x)
5.1. DL-LITEA ONTOLOGY-BASED DIS
67
– Ma is as follows: M1 :
1 (s, p, d) ← S (s, p, d) qdb ; 1 qG1 (s, p, d) ← tempEmp(pers(s)), projName(proj(p), p), until(pers(s), proj(p), d)
M2 :
2 (s, n) ← S (s, n) qdb ; 2 2 qG (s, n) ← employee(pers(s)), persName(pers(s), n)
M3 :
3 (s, n) ← ∃c.S (c, n) ∧ S (c, s) qdb ; 3 4 qG3 (s, n) ← manager(pers(s)), persName(pers(s), n)
M4 :
4 (c, n) ← S (c, n) ∧ ¬∃s.S (c, s) qdb ; 3 4 qG4 (c, n) ← manager(mgr(c)), persName(mgr(c), n)
Suppose D be a set of sources {D1 , D2 , D3 , D4 } conforming to S. D1 stores tuples (s, p, d), where s and p are strings, and d is a date, such that s is the social security number of a temporary employee, p is the name of project s/he works for (different projects have different names), and d is the ending date of the employment. D2 stores tuples (s, n) of strings consisting of the social security number s of an employee and her/his name n. D3 stores tuples (c, n) of strings consisting of the code c of a manager and her/his name n. Finally, D4 relates managers’ code with their social security number. Thus, intuitively, typing assertions in Mt establish how to map SQL source datatypes to RDF data types of values occuring in the ontology. Concerning the mapping specification Mt , M1 captures that every tuple (s, p, d) in D1 corresponds to a temporary employee pers(s), working until d for a project proj(p), whose name is p. M2 extracts employees pers(s) and their name n. M3 and M4 tell us how to extract from D3 information about managers and their name. When we extract such information, if we can we make use of D4 which provides us the social security number of managers (identified by a code in D3 ), then we use object terms of the form pers(s). If such information is not available in D4 , then we use object terms of the form mgr(c). In order to define the semantics of an DL-LiteA DIS, we need to define when an interpretation satisfies a mapping w.r.t. a set of data sources D. Thus, let D = {D1 , ..., Ds } be a set of data sources such that Dj conforms to Sj , for each Sj ∈ S. According to the usual semantics of sound mappings (cf. Section 1.1), we say that I satisfies M : Φ ; Ψ w.r.t. D if for each tuple of values ~t in ΓV , if ~t ∈ ans(Φ, D), then we have that tI ∈ ΨI , where, as usual, ans(Φ, D) denotes the set of answers to the query Φ posed over the set of sources D. Thus, we can now give the semantics of a DL-LiteA DIS. Let D be a set of data sources conforming to S. An interpretation I = (∆I , ·I ) is a model of Π w.r.t. D if and only if: ❒ I is a model of G;
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
68
❒ I satisfies all mapping assertions in M w.r.t. D. As usual, we say that a DL-LiteA DIS is consistent w.r.t. D if there exists a model of Π w.r.t. D. Example 5.1.2 Refer to the DL-LiteA DIS Π of Example 5.1.1. A possible set of data sources conforming to S is the following: D1 D2 D3 D4
= = = =
{(20903, Tones, 25-09-05)} {(20903, Palmieri), (55577, Parker)} {(Lenz, Lenzerini), (Abit, Abiteboul)} {(Lenz, 29767)}
One can easily verify that Π is consistent w.r.t. D. Let Q denote a union of conjunctive queries over Π of arity n. As usual in DIS, Q is expressed in terms of the global schema G. Moreover, we call certain answers to Q posed over Π w.r.t. D the set of n-tuples of constants in Γ(ΓV , Λ) ∪ ΓV , denoted Q(Π , D), that is defined as follows: Q(Π , D) = {t | tI ∈ QI , I ∈ sem(Π, D)}
5.2 Overview of consistency and query answering method In this section, we present an overview of our solution for checking DL-LiteA DIS consistency and solving query answering. The simplest way to tackle these problems over a DL-LiteA DIS is to use the mappings to produce an actual ABox and then reasoning on the ontology constituted by the ABox and the original TBox, applying the techniques described in Chapter 4. We call such approach “bottom-up”. The bottomup approach involves a duplication of the data in the database so as to populate the new ABox, and this is clearly unacceptable in several circumstances. So we propose an alternative approach, called “top-down” that avoids such a duplication essentially by keeping the ABox virtual. We sketch out the main ideas of both approaches below, by first presenting the notion of virtual ABox. Then, we provide preliminary basic notions of logic programming upon which the technical development of the next section is built.
5.2.1
The notion of virtual ABox
Definition 5.2.1 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of sources conforming to S, and let M be a mapping assertion in M of the form M = Φ ; Ψ. We call virtual ABox generated by M from D the set of assertions, denoted A(M, D), such that: A(M, D) = {Ψ[~x/~t]|~t ∈ ans(Φ, D)}, where ~t and Φ, Ψ are of arity n, and Ψ[~x/~t] denotes the formula obtained from Ψ(~x) by substituting the n-tuple of variables ~x with the n-tuple of constants ~t ∈ ΓnV . Moreover, we call virtual ABox for Π the set of assertions: A(M, D) = {A(M, D) | M ∈ M}.
5.2. OVERVIEW OF CONSISTENCY AND QUERY ANSWERING METHOD 69 Notice that A(M, D) is an ABox over ΓO (Λ, ΓV ) and ΓV , as shown in the following example. Example 5.2.2 Let Π = hG, S, Mi be the DL-LiteA DIS of Example 5.1.1. Consider in particular the mapping M2 such that: M2 :
2 (s, n) ← S (s, n) qdb ; 2 qG2 (s, n) ← employee(pers(s)), persName(pers(s), n)
2 , D) = Then, given the database D of Example 5.1.2, suppose to have: ans(qdb {(20903, Palmieri), (55577, Parker)}, then A(M2 , D) is as follows:
employee(pers(20903))
(5.1)
persName(pers(20903, Palmieri))
(5.2)
employee(pers(55577))
(5.3)
persName(pers(55577, Parker)
(5.4)
By proceeding in the same way for each mapping assertion in M, we can easily obtain the virtual ABox of Π. Virtual ABoxes allow for expressing the semantics of DL-LiteA DIS, in terms of the semantics of DL-LiteA ontologies as follows: Proposition 5.2.3 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of data sources conforming to S, and let A(M, D) be the virtual ABox for Π from D, we have that sem(Π, D) = {I | I |= G ∧ I |= A(M, D)} = M od(K), where K = hG, A(M, D)i.
Proof. Trivial, from the definition. Now that we have introduced virtual ABox, we start by discussing the bottom-up approach.
5.2.2
A “naive” bottom-up approach
The proposition above suggests an obvious bottom-up algorithm to solve consistency and query answering over a DL-LiteA DIS Π = hG, S, Mi, which we describe next. First, given a set D of data sources conforming to S, we materialize the virtual ABox for Π from D. Second, we apply to the DL-LiteA KB K = hG, A(M, D)i, the algorithms for checking DL-LiteA KB satisfiability and query answering, described in Chapter 4. This way of proceeding is sufficient to solve satisfiability, whereas for query answering over a DL-LiteA DIS, we need to further take carefully into account the possible presence of variable object terms in the query. Intuitively, this requires to proceed as follows. Given a union of conjunctive queries Q over a DL-LiteA DIS, we first substitute each distinct variable object term in Q with a new object variable, thus obtaining a query Q′ , which contains only object variables, object constants, value
70
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
variables, and value constants. Therefore, we can process Q′ exactly as discussed in the previous chapter. As result, we obtain a set of tuples whose elements are data values in ΓV . Finally, by post-processing the answers so as to reconstruct object terms starting from values, we obtain the certain answers to Q. Unfortunately, the approach described above for solving both DL-LiteA DIS consistency and query answering has the following drawbacks. First, the algorithm proposed is in PT IME in the size of the database, since the generation of the virtual ABox is by itself PT IME. Second, since the database is independent from the ontology, it may happen that data it contains are modified. This would clearly require to set up a mechanism for keeping the virtual ABox up-to-date with respect to the database evolution. Thus, next, we propose a different approach (called “top-down”), which uses an algorithm that avoids materializing the virtual ABox, but takes into account the mapping specification on-the-fly, during reasoning. In this way, we can both keep low the computational complexity of the algorithm, which turns out to be the same of the query answering algorithm for DL-LiteA KBs, i.e., in L OG S PACE, and avoid any further procedure for data refreshment.
5.2.3
A top-down approach
We now sketch out the main steps of a top-down approach. First, we rely on the property of DL-LiteA of allowing reducing KB satisfiability and query answering to the evaluation of a first-order logic query Q over the ABox seen as a database. Since DL-LiteA DIS are not defined in terms of an ABox and a TBox but rather they are specified in terms of a TBox, a set of mappings and a set of data sources, the evaluation of Q cannot be performed over an ABox (unless we accept materializing the virtual ABox as described in the previous section). Thus, the idea is to further reformulate Q, by taking into account the mapping assertions, so as to produce a query that can be asked directly to the set of data sources. Specifically, we start by reducing the mapping assertions into a “split” form such that they can be seen as a part that extracts relevant data from the database and a part that specifies how the object terms of the ontology are built from such data. For the latter we use logic programming technology that tells us how to perform unifications and generate the right object terms required in the query Q. Then, making use of the first part of the split, we can formulate a new query over the database that tells us how to instantiate the variable of the original query with actual data. Observe that in this way, data are accessed only at the very last, namely at the moment of evaluating the new reformulated query over the set of data sources, and that the evaluation of such a query can be completely delegated to the DBMS that manages the database.
Example 5.2.4 Consider again the DL-LiteA DIS Π of Example 5.1.1 and the query q discussed in Example 4.4.4 together with its reformulation Qp = PerfectRef(q, G).
5.2. OVERVIEW OF CONSISTENCY AND QUERY ANSWERING METHOD 71 By splitting the mappings M we obtain the following portion of logic program: tempEmp(pers(s)) projName(proj(p), p) until(pers(s), proj(p), d) employee(pers(s)) persName(pers(s), n) manager(pers(s)) persName(pers(s), n) manager(mgr(c)) persName(mgr(c), n)
← Aux1 (s, p, d) ← Aux1 (s, p, d) ← Aux1 (s, p, d) ← Aux2 (s, n) ← Aux2 (s, n) ← Aux3 (s, n) ← Aux3 (s, n) ← Aux4 (c, n) ← Aux4 (c, n)
where Auxk is a predicate denoting the result of the evaluation over a set of data sources D conforming to S, of the query Φk in the left-hand side of the mapping Mk . Finally, we unfold each atom of the query obtained Qp , by unifying it in all possible ways with the left-hand side of mapping assertions (seen as conjunctions of atoms in logic programming syntax), and we obtain the following union of results: 1 , D)} ∪ q Π = {pers(s) | (s, p, d) ∈ ans(qdb 2 , D)} ∪ {pers(s) | (s, n) ∈ ans(qdb 3 , D)} ∪ {pers(s) | (s, n) ∈ ans(qdb 4 , D)} {mgr(c) | (c, n) ∈ ans(qdb
Note that all the approach relies crucially on the use of standard notions of logic programming, which we briefly introduce in the next section.
5.2.4
Relevant notions from logic programming
We next briefly recall some basic notions of logic programming [70] and Partial Evaluation [61], upon which we build the top-down approach. In particular, we exploit some crucial results on Partial Evaluation, given in [71], which we briefly recall below. Definition 5.2.5 A definite program clause is an expression of the form A←W where A is an atom and W is a conjunction of atoms A1 , . . . , An . We call head of the clause its left-hand side, which contains A, and body of the clause its right hand side, which contains W . Either the body or the head of the clause may be empty. When the body is empty the clause is called fact (and the ← symbol is in general omitted). When the head is empty the clause is called definite goal. A definite program is a finite set of definite program clauses. Notice that A ← W has also a first-order logic interpretation, which is as follows: ∀x1 , · · · , xs (A ∨ ¬W ). where x1 , . . . , xs are all variables occurring in W and A.
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
72
From now on, when we talk about programs, program clauses and goals, we implicitly mean definite programs, definite program clauses and definite goals, respectively. From a well-known result of logic programming we have the following crucial property of definite programs: Proposition 5.2.6 (Program minimal model) For each program P, the intersection MP of all Herbrand models for P is a model of P, called minimal model of P. We say that an atom containing no variables is true in a logic program P if it is true in the minimal model of P. Definition 5.2.7 Let G be the goal ← A1 , · · · , Am , · · · , Ak and C be a program clause A ← B1 , · · · , Bq . Then G′ is derived from G and C using the most general unifier (mgu) θ if the following conditions hold: ❒ Am is an atom, called the selected atom, in G, ❒ θ is an mgu of Am and A, and ❒ G′ is the goal ← (A1 , . . . , Am−1 , B1 , . . . , Bq , Am+1 , . . . , Ak )θ where (A1 , · · · , An )θ = A1 θ, · · · , An θ and Aθ is the atom obtained from A applying the substitution θ. Definition 5.2.8 A resultant is an expression of the form Q1 ← Q2 where Qi (i = 1, 2), is either absent or a conjunction of literals. All variables in Q1 and Q2 are assumed to be universally quantified. Definition 5.2.9 Let P be a program and let G be a goal. Then, a (partial) SLD-Tree of P ∪ {G}1 is a tree satisfying the following: ❒ each node of the tree is a resultant, ❒ the root node is Gθ0 ← G0 , where Gθ0 = G0 = G (i.e. θ0 is the empty substitution), ❒ let Gθ0 · · · θi ← Gi be a node at depth i ≥ 0 such that Gi has the form A1 , · · · , Am , · · · , Ak , and suppose that Am is the selected atom. Then, for each input clause A ← B1 , · · · , Bq such that Am and A are unifiable with mgu θi+1 , the node has a child Gθ1 · θ2 · · · θi+1 ← Gi+1 , where Gi+1 is derived from Gi and Am by using θi+1 , i.e. Gi+1 has the form (A1 , · · · , B1 , · · · , Bq , · · · , Ak )θi+1 , 1
Note that this definition of SLD-Tree comes from [71].
5.3. DL-LITEA DIS CONSISTENCY AND QUERY ANSWERING
73
❒ nodes which are the empty clause have no children. Given a branch of the tree, we say that it is a failing branch if it ends in a node such that the selected atom does not unify with the head of any program clause. Moreover, we say that a SLD-Tree is complete if it is such that all non-failing branches end in the empty clause. Finally, given a node Qθ ← Qn at depth i, we say that the derivation of Qi has length i with computed answer θ, where θ is the restriction of θ0 , . . . , θi to the variables in G. Now, we state the definition of partial evaluation (PE for short) from [71]. Note that the definition refers to two kinds of PE: the PE of an atom in a program, and the PE of a program w.r.t. an atom. Definition 5.2.10 Let P be a program, A an atom, and T a SLD-tree for P ∪{← A}. Let G1 , . . . , Gr be a set of (non-root) goals in T such that each non-failed branch of T contains exactly one of them. Let Ri (i = 1, . . . , r) be the resultant of the derivation from ← A down to Gi associated with the branch leading to Gi . ❒ The set of resultants π = {R1 , . . . , Rr } is a PE of A in P . These resultants have the following form: Ri = Aθi ← Qi (i = 1, . . . , r), where we have assumed Gi =← Qi ❒ Let P ′ be the program resulting from P by replacing the set of clauses in P whose head contains A. Then P ′ is a PE of P w.r.t. A. Intuitively, to obtain a PE of an atom A in P we consider a SLD-tree T for P ∪ {← A}, and choose a cut in T . The PE is defined as the union of the resultants that occur in the cut and do not fail in T .
5.3 DL-LiteA DIS consistency and query answering In this section, we present the core result of this chapter, namely the modularizability of DL-LiteA DIS consistency and query answering services. Then, we provide algorithms for DL-LiteA DIS consistency and query answering, based on such result. Finally, we discuss computational complexity issues.
5.3.1
Modularizability
In order to introduce modularizabilty of DL-LiteA reasoning services, according to the top-down approach discussed in the previous section, we need to present the notion of split version of a DL-LiteA DIS. Such notion characterizes DL-LiteA ontologies having a particularly “friendly” form. Specifically, given a DL-LiteA DIS Π = hG, S, Mi, we compute the split version of Π, denoted Split(Π) = hG ′ , S, M′ i, by setting G ′ = G, and by constructing M′ as follows. For each mapping assertion Φ ; Ψ ∈ M, and for each atom p ∈ Ψ, we add an assertion Φ ; p into M′ . Luckily, we have the following.
74
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
Proposition 5.3.1 Let Π = hG, S, Mi be a DL-LiteA DIS and D a set of data sources conforming to S. Then, we have that: sem(Split(Π), D) = sem(Π, D).
Proof. The result follows straightforwardly from the form of the mappings and Proposition 5.2.3. Thus, given any arbitrary DL-LiteA DIS, we can always reduce it to its split version. Moreover, such a reduction is PT IME in the size of the mappings and does not depend on the size of the data. This allows for assuming, from now on, to deal only with split versions of DL-LiteA DIS. In what follows, we use the definitions given in the previous section to present modularizability of reasoning in DL-LiteA . In particular, the goal is to define a function RewDB, that, intuitively, takes as input a union of conjunctive queries (possibly with inequalities) Q over the virtual ABox for Π from a set of data sources D conforming to S, and returns a set of resultants describing (i) the queries to pose over D, and (ii) the substitution to apply to the result in order to obtain the answer to Q. In particular, we start by defining the notion of program for a query. Definition 5.3.2 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set data sources conforming to S, and Q a union of conjunctive queries over db(A(M, D)), possibly including inequalities. Then, we call program for Q, denoted P(Q), the logic program having the following form: P = {ans ← q g | q g = σ(q), q ∈ Q} ~ ← Auxk (~x) | Φk (~x) ; pk (f (x)) ~ ∈ M} ∪{pk (f (x)) ∪{Auxk (~t) | ~t ∈ ans(Φk , D), k = 1, · · · , m} ∪{Distinct(v1 , v2 ) | v1 , v2 ∈ ΓV , v1 6= v2 } ∪{Distinct(f1 (v1 ), f2 (v2 )) | v1 , v2 ∈ ΓV , f1 , f2 ∈ Λ, f1 6= f2 } where m is the number of mappings in M, and ❒ for each k ∈ {1, · · · , m}, Auxk is an auxiliary predicate whose extension coincides with the set of tuples in ans(Φk , D); ❒ Distinct is an auxiliary predicate whose extension coincides with the set of pairs of distinct terms in ΓO (Λ, ΓV ), and distinct constants in D; ❒ q g = σ(q) denotes a conjunction of atoms that is obtained by replacing each x 6= y in the body of a query q in the union Q with the atom Distinct(x, y); ❒ ans is an auxiliary predicate having the same arity as Qg . Below, we denote by θ~t the substitution of the variables in ans with the terms in ~t. The following lemma states a notable property of the programs defined above. Lemma 5.3.3 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries over db(A(M, D)), possibly including inequalities, and P(Q) the program for Q. Then, db(A(M, D)) coincides with the projection over G of the minimal model of P(Q).
5.3. DL-LITEA DIS CONSISTENCY AND QUERY ANSWERING
75
Proof. To prove the theorem we first show that for each n-tuple of object terms ~t ∈ Γ(ΓV , Λ)n , if ~t belongs to X in db(A(M, D)), then we have that X(~t) is true in the minimal model MP of P(Q). Consider a tuple ~t of object terms that belongs to X in db(A(M, D)), i.e., such that X(~t) ∈ A(M, D). Thus, by construction of A(M, D) ~ in M, a tuple ~t′ of values in we have that there exists a mapping Φk (~x) ; X(f (x)) ′ ~ ~ ′ ). ΓV , and a substitution θ : {~x/f (t′ )} such that ~t ∈ ans(Φk , D) and ~t = ~xθ = f (t But then, since ~t′ ∈ ans(Φk , DB), we have that Auxk (~t′ ) ∈ P(Q). Moreover, since ~ is a mapping in M, we have that X(f (x)) ~ ← Auxk (~x) belongs Φk (~x) ; X(f (x)) ′ ~ to P(Q). Thus, θ is a mgu of Auxk (~x) and Auxk (t ). Therefore, it is possible to derive X(~t) from Auxk (~x) and Auxk (~t′ ) by using θ, which proves that X(~t) is true in MP . Conversely, let X(~t) be true in the minimal model MP of P(Q), and let X be an expression of G. Clearly, by following a similar line of reasoning as above, we show that ~t belongs to X in db(A(M, D)). Corollary 5.3.4 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries over db(A(M, D)), possibly with inequalities. Moreover, let P be the program for Q. Then, for each tuple ~t of constants in Γ(ΓV , Λ) ∪ ΓV , we have that: ~t ∈ ans(Q, db(A(M, D))) if and only if P(Q) ∪ {← ansθ~t} is unsatisfiable.
Proof. The result follows directly from the previous lemma and the construction of P(Q). Given a union of conjunctive queries (possibly with inequalities) Q over db(A(M, D)), let SLD-Derive(P(Q)) be a function that takes as input the program P(Q) and returns a set of resultants, as follows. ❒ First, it constructs a SLD-Tree T for P(Q) ∪ {← ans} as follows: – it starts by selecting the atom ans, and – then, it continues by selecting the atoms that belong to the alphabet of G, until there are some. ❒ Second, it returns the set S of the leaves ansθj ← qj of T , that do not belong to any failing branch of T . Note that SLD-Derive can use any procedure to compute the SLD-Tree for P(Q) ∪ {← ans}, provided that the computation rule follows the requirements above. We have the following. Lemma 5.3.5 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(A(M, D)). Moreover, let SLD-Derive(P(Q)) 6= ∅. Then, SLD-Derive(P(Q)) is a PE of {← ans} w.r.t. P(Q).
76
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
Proof. Trivial, by construction and by the definition of PE. Consider now the partial evaluation of P(Q) w.r.t. {← ans}. In what follows, we denote it P(Q, S). We then have the following. Lemma 5.3.6 Let Π = hG, S, Mi be a DL-LiteA DIS, , D a set of data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(A(M, D)). Moreover, let S = SLD-Derive(P(Q)) 6= ∅. Then, for each atom A, A is true in P(Q, S) if and only if A is true in P(Q).
Proof. The proof follows from a well-known result from [71], stating that if P ′ is a PE of a program P w.r.t. {G}, then P and P ′ are procedurally equivalent, i.e. for each atom A, A belongs to the minimal model of P if and only if A belongs to the minimal model of P ′ . Let S be a (non-empty) PE of P(Q) w.r.t. {← ans} and Q be a resultant ansθ ← q in S. We define unfold Π (S, Q) as the function that returns a (extended form of) resultant Q′ = ansθ ← q ′ such that q ′ is a first-order query over D, which is obtained from q by proceeding as follows. At the beginning, q ′ has an empty body. Then, for each atom A in q, ❒ if A = Auxk (~x), we add to q ′ the query Φk (~x); note that, by hypothesis, Φk (~x) is an arbitrary first-order query with distinguished variables ~x, that can be evaluated over D; ❒ if A = Distinct(x1 , x2 ), where x1 , x2 have resp. the form f1 (~y1 ) and f2 (~y2 ), then: – if f1 6= f2 , then we do not add any conjunct to q ′ , – otherwise, we add the following conjunct: _ y1i 6= y2i , i∈{1,...,w}
where w is the arity f1 . Here again, note that we obtain a disjunction of variables, which can be obviously evaluated over a set of data sources D. Lemma 5.3.7 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set f data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(A(M, D)). Moreover, let S = SLD-Derive(P(Q)) 6= ∅. Then for each Q = ansθ ← q ∈ S and for each tuple of constants ~t in ΓV , we have that: ansθ~t is true in P(Q, S) if and only if ~t′ ∈ ans(q ′ , D), where ~t = ~t′ θ, and unfold Π (S, Q) = (ansθ ← q ′ ).
Proof. Let Q = ansθ ← q be a resultant in S such that q has the form A1 (~x1 ), · · · , An (~xn ). By construction, Ai (~xi ) may either have the form:
5.3. DL-LITEA DIS CONSISTENCY AND QUERY ANSWERING
77
❒ Auxki (~xi ); ❒ or Distinct(~xi ), where ~xi = (xi1 , xi2 ). Suppose that Ai has predicate Auxki for each i ≤ j whereas it has predicate Distinct for j < i ≤ n. Consider now Q′ = unfold Π (S, Q). By construction, Q′ = ansθ ← q ′ where q ′ has the form: {~x, · · · yi11 , yi21 , · · · yi1w , yi2w · · · iW i | Φk1 (~x1 ), · · · , Φkj (~xj ), ( i≤n
W ( h∈{1,...,wi } yi1h 6= yi2h ))}
W where ( h∈{1,...,wi } yi1h 6= yi2h ) occurs together with the corresponding distinguished variables yi1h , yi2h if there is an atom Distinct(xi1 , xi2 ) in q such that xi1 = f (~yi1 ), xi2 = f (~yi2 ) where f has arity wi . Now, let ~t be a tuple of constants in ΓV . We show next that if q(~t) is true in P(Q, S), then ~t′ ∈ ans(q ′ , D) where ~t = ~t′ θ. Suppose that q(~t) is true in P(Q, S). Then there exists θq such that q(~t) = (A1 (~x1 ), · · · , An (~xn ))θq is true in P(Q, S). This implies that there exist n facts Fi in P(Q, S) such that Fi = Ai θq is true in P(Q, S) for each i = 1, · · · , n. But then, by construction: ❒ if i ≤ j, then Fi has the form Auxki (~ti ), which by construction means that ~ti ∈ ans(Φki , D); ❒ otherwise, Fi has the form Distinct(~ti ), where ~ti = (ti1 , ti2 ) and ti1 , ti2 are terms in Γ(ΓV , Λ) such that ti1 6= ti2 , i.e. ti1 = f1 (vi1 ), ti2 = f2 (vi2 ) where either f1 , f2 ∈ Λ, f1 6= f2 or vi1 6= vi2 . Then, one can easily verify that ~t′ ∈ ans(q ′ , D). Indeed, for i ≤ j we have trivially that Φki (~t′i ) is true, whereas for j < i ≤ k, we have that if f1 = f2 , then vi1h 6= vi2h , for h ∈ {1, . . . , wi } where wi is the arity of f1 . Thus, since ansθq belongs to P(Q, S), then ~t = ~t′ θ, and we have proved the claim. Clearly, the converse of the lemma can be proved by following the same line of reasoning. Before presenting RewDB, we need to introduce one more notion, i.e. the notion of compilation for Q. Given a union of conjunctive queries Q, we call compilation for Q, denoted C(Q), the program obtained from P(Q) by eliminating all facts Auxk (~t) and Distinct(~t). We then have the following. Lemma 5.3.8 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries (possibly with inequalities) over db(A(M, D)). Then, we have that SLD-Derive(P(Q)) = SLD-Derive(C(Q)).
Proof. The proof follows from the observation that SLD-Derive(P(Q)) constructs a SLD-Tree for P(Q) ∪ {← ans} by selecting only the atoms in the alphabet of G, and that P(Q) and C(Q) coincide in the clauses containing atoms in the alphabet of G. Now we are finally able to come back to the definition of RewDB. Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of data sources conforming to S, and Q a
78
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
Algorithm RewDB(Q, Π, D) Input: DL-LiteA DIS Π = hG, S, Mi, set of data sources D conforming to S union of conjunctive queries (possibly with inequalities) Q over db(A(M, D)) Output: set of resultants S ′ build the program C(Q); compute the set of resultants S = SLD-Derive(C(Q)); for each ansθ ← q ∈ S do S ′ ← unfold Π (S, Q); return S ′ Figure 5.1: The Algorithm RewDB union of conjunctive queries (possibly with inequalities) over db(A(M, D)). We define RewDB(Q, Π, D) as the function that takes as input Q, Π, and D, and returns a set S ′ of resultants by proceeding as shown in Fig. 5.3.1. We now show the correctness of RewDB. Theorem 5.3.9 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of data sources conforming to S, and Q a union of conjunctive queries over db(A(M, D)). Then, RewDB(Q, Π, D) terminates. Moreover, let S ′ = RewDB(Q, Π, D). Then, for each tuple of constants ~t in Γ(ΓV , Λ) ∪ ΓV and for each tuple of constants ~t′ ∈ ΓV : ~t ∈ ans(Q, db(A(M, D))) if and only if ∃(ansθ ← q ′ ) ∈ S ′ such that ~t = ~t′ θ and ~t′ ∈ ans(q ′ , D)
Proof. Concerning termination, it is clear that RewDB always terminates since, by construction, all its steps terminate. Let us now focus on soundness and completeness of RewDB(Q, Π, D). Specifically, suppose first that RewDB(Q, Π, D) = ∅. Then, by construction, SLD-Derive(C(Q)) = ∅. By Lemma 5.3.8, we also have that SLD-Derive(P(Q)) = ∅. Thus, the SLD-Tree for P(Q) ∪ {← ans} contains only failing branches, which implies that P(Q) ∪ {¬ans} is satisfiable. Therefore, by Corollary 5.3.4, we have that there exists no ~t such that ~t ∈ ans(Q, db(A(M, D))), which proves that the theorem holds. Suppose now that RewDB(Q, Π, D) = S ′ 6= ∅. Let Q′ = ansθ ← q ′ be a resultant in S ′ . Then, by construction, we have that there exists Q = ansθ ← q in S = SLD-Derive(C(Q)) such that q is a conjunctive query in Q and unfold (S, Q) = Q′ . Thus, since SLD-Derive(C(Q)) = SLD-Derive(P(Q)), by Lemma 5.3.7, we have that for each tuple of constants ~t′ in ΓV and for each tuple of constants ~t in Γ(ΓV , Λ) ∪ ΓV : ~t = ~t′ θ ∧ ~t′ ∈ ans(q ′ , D) ⇐⇒ ansθ~t is true in P(Q, S). Let ~t be a tuple of constants in Γ(ΓV , Λ) ∪ ΓV . By Lemma 5.3.5, P(Q, S) is the PE of P(Q) w.r.t. {← ansθ~t}. Therefore, by Lemma 5.3.6, we have that: ansθ~t is true in P(Q, S) ⇐⇒ ansθ~t is true in P(Q).
5.3. DL-LITEA DIS CONSISTENCY AND QUERY ANSWERING
79
But then, by Corollary 5.3.4, we obtain: ansθ~t is true in P(Q) ⇐⇒ ~t ∈ ans(Q, db(A(M, D))). By definition of the semantics of a union of conjunctive queries with inequalities, we have that ~t ∈ ans(Q, db(A(M, D))) if and only if there exists a query q¯ in Q such that ~t ∈ ans(¯ q , db(A(M, D))). Thus, since q is a conjunctive query in Q, we obtain the claim. Note that the correctness of RewDB is crucial, in that it allows completely forgetting the mappings, by compiling them directly in the queries to be posed over the underlying database. This proves the modularizability of rconsistency and query answering DL-LiteA DIS services. Specifically, we will see in the next section that Algorithm RewDB allows reasoning by exploiting, on one hand, results of reasoning over DL-LiteA KBs, on the other hand, the ability of the underlying database of answering arbitrary complex queries.
5.3.2
Consistency algorithm
Let Π = hG, S, Mi be a DL-LiteA DIS and D a set of data sources conforming to S. In Fig. 5.3.2, we present an Algorithm Sat(Π, D) that, thanks to the use of the function RewDB, strongly resembles to the Algorithm Sat(K) presented in the previous chapter (Fig. 4.3.2) to check the satisfiability of a DL-LiteA KB. More precisely, for each functionality assertion and each NI in the NI-Closure of G (denoted as usual cln(G)), Sat(Π, D) uses the functions ViolateFunct and ViolateNI, respectively defined in Fig. 4.3.2 and Fig. 4.3.2, that return a first-order query Q checking whether the minimal model of the virtual ABox generated from the mappings w.r.t. D violates any assertion of the global schema. Then, the algorithm uses the function RewDB(Q) that allows to “forget” the mapping assertions, by returning the set of resultants S ′ as discussed in the previous section. After having further extracted from S ′ the union of queries Q′ , Sat(Π, D) evaluates Q′ over D and returns false, if ans(Q′ , D) returns true. Otherwise, if no functionality nor NI assertion “generates” a query returning true, then Sat(Π, D) returns true. As expected, we have the following result. Theorem 5.3.10 Let Π = hG, S, Mi be a DL-LiteA DIS and D a set of sources conforming to S. Then, Sat(Π, D) terminates. Moreover, Π is consistent w.r.t. D if and only if Sat(Π, D) = true.
Proof. The termination of the Algorithm follows from the termination of RewDB. Concerning the soundness and the completeness of the algorithm, by Proposition 5.2.3, we have that Π is consistent w.r.t. D if and only if K = hG, db(A(M, D))i is unsatisfiable. Moreover, by Lemma 4.3.7, we have that K = hG, db(A(M, D))i is unsatisfiable if and only if Qdb(A(M,D)) = true for each Q such that Q = ViolateFunct(X) for some functionality assertion X ∈ G, or Q = ViolateNI(X) for some NI assertion X ∈ cln(G). Thus, in order to prove the theorem, it suffices to prove that: (∗)Qdb(A(M,D)) = true if and only if ans(Q′ , D) = true,
80
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
Input:DL-LiteA DIS Π = hG, S, Mi, a set of sources D conforming to S Output:true or false (1) for each F = (funct P ) ∈ G do Q ← ViolateFunct(F ); S ′ ← RewDB(Q); Q′ ← false; for each ansθ ← q ′ ∈ S ′ do Q′ ← Q′ ∪ {q ′ }; if (ans(Q′ , D) = true) then return false (2) for each N I = X1 ⊑ ¬X2 ∈ cln(G) do Q ← ViolateNI(N I)); S ′ ← RewDB(Q); for each ansθ ← q ′ ∈ S ′ do Q′ ← Q′ ∪ {q ′ }; if (ans(Q′ , D) = true) then return false (3) return true Figure 5.2: Algorithm Sat (Π, D) S for each Q described as above, where Q′ is such that Q′ = ansθ←q′ ∈S ′ q and S ′ = RewDB(Q). Clearly, this concludes the proof, since (∗) follows straightforwardly from the correctness of RewDB (cf. Theorem 5.3.9).
5.3.3
Query answering algorithm
Let Π = hG, S, Mi be a DL-LiteA DIS and D and a set of data sources conforming to S. In Fig. 5.3.3, we present an Algorithm Answer(Q, Π, D) that once again, is very similar to the Algorithm Answer(Q, K) presented in the previous chapter (Fig. 4.4.2) to answer queries posed over a DL-LiteA KB. Informally, the algorithm takes as input a DL-LiteA DIS, a set of data sources conforming to S and a union of conjunctive queries Q over Π. Then, it proceeds exactly as in the case of DL-LiteA KBs (note that analogously to the case of KBs, if Π is not consistent w.r.t. D, then ans(Q, Π, D) is the set of all possible tuples of object terms in Γ(ΓV , Λ) and constants in ΓV , denoted AllTup(Q, Π), whose arity is the one of the query Q). Thus, it first computes the NIClosure of G, and then it computes the perfect reformulation Qp of Q. At this point, Answer reformulates Qp by calling RewDB(Qp ) to compute the set of resultants S ′ . Then, for each resultant Q′ in S ′ , it extracts the conjunctive query in its body, evaluates it over D and further processes the answers according to the substitution occurring in the head of Q′ . We next show the correctness of Algorithm Answer(Q, Π, D).
5.3. DL-LITEA DIS CONSISTENCY AND QUERY ANSWERING
81
Input:UCQ Q, DL-LiteA DIS Π = hG, S, Mi, set of data sources D conforming to S, such that Π is consistent w.r.t. D Output: ans(Q, D) G ← cln(G); if Π is consistent w.r.t. D then return AllTup(Q, K) else S Qp ← qi ∈Q PerfectRef(qi , G); S ′ ← RewDB(Qp ); Rs ← ∅; for each ansθ ← q ′ ∈ S ′ do Rs ← Rs ∪ ans(q, D)θ; return Rs ;
Figure 5.3: Algorithm Answer(Q, Π, D) Theorem 5.3.11 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of sources conforming to S, and Q a union of conjunctive queries over Π. Then, Answer(Q, Π, D) terminates. Moreover, let Rs be the set of tuples returned by Answer(Q, Π, D), and let ~t be a tuple of constants in Γ(ΓV , Λ). Then, ~t ∈ Q(Π , D) iff ~t ∈ Rs .
Proof. The termination of the algorithm follows from the termination of the Algorithm PerfectRefand teh function RewDB. Concerning the soundness and completeness of the Algorithm Answer, by Proposition 5.2.3, we have that: sem(Π, D) = M od(K), where K = hG, db(A(M, D))i. Moreover, given a union of conjunctive queries Q, by Lemma 4.4.5, we have that ans(Q, K) = (Qp )db(A(M,D)) , where (Qp ) = PerfectRef(Q). Then, since by definition, we have that: ❒ ans(Q, K) = {~t | ~tI ∈ QI , I ∈ M od(K)}, and ❒ Q(Π , D) = {~t | ~tI ∈ QI , I ∈ sem(Π, D)}, it is easy to see that: Q(Π , D) = (Qp )db(A(M,D)) . On the other hand, by construction, we have that: Rs = {~t′ θ | ~t′ ∈ ans((, q)′ , D), ansθ ← q ′ ∈ S ′ }. Then, clearly,from the coorectness of RewDB(cf. Theorem 5.3.9), we obtain the claim.
CHAPTER 5. CONSISTENCY AND QUERY ANSWERING OVER ONTOLOGY-BASED DIS
82
5.3.4
Computational complexity
We first prove termination and complexity of RewDB. Lemma 5.3.12 Let Π = hG, S, Mi be a DL-LiteA DIS, D a set of sources conforming to S, and q a conjunctive queries over Π. The function RewDB(Q, Π, D) runs in exponential time w.r.t. the size of Q, and in polynomial time w.r.t. the size of M.
Proof. Let Q be a union of conjunctive queries, and let n be the total number of atoms in the body of all q’s in Q. Moreover, let m be the number of mappings and let mn be the maximum size of the body of mappings. The proof follows immediately from considering the cost of each of the three steps of RewDB(Q, Π, D): 1. The construction of C(Q) is clearly polynomial in n and m. 2. The computation of SLD-Derive(C(Q)) builds first a tree of depth at most n such that each of its nodes has at most m children, and, second, it processes all the leaves of the tree to obtain the set S of resultants. By construction, this set has size O(mn ). Clearly, the overall computation has complexity O(mn ). 3. Finally, the application of the function unfold Π to each element in S has complexity O(mn ∗ mn ).
Based on the above property, we are able to establish the complexity of checking the consistency of a DL-LiteA DIS w.r.t. D and the complexity of answering union of conjunctive queries over DL-LiteA DIS w.r.t. D. Theorem 5.3.13 Given a DL-LiteA DIS Π and a set of data sources D, Sat(Π, D) is L OG S PACE in the size of the D (data complexity). Moreover, it runs in polynomial time in the size of M, and in polynomial time in the size of G.
Proof. The proof of the claim is a consequence of the correctness of the algorithm Sat(Π, D), established in Theorem 5.3.10 and the fact: 1. the Algorithm Sat(Π, D) generates a number of queries Q over the minimal model of the virtual ABox that is polynomial in the size of G; 2. each query Q contains 2 atoms and thus, by Lemma 5.3.12, the application of RewDB to each Q is polynomial in the size of the mapping M and constant in the size of the data sources; 3. the evaluation of a union of conjunctive queries over a database can be computed in L OG S PACE with respect to the size of the database (since unions of conjunctive queries are a subclass of first-order logic queries).
83 Theorem 5.3.14 Given a DL-LiteA DIS Π and a set of data sources D, Answer(Q, Π, D) is L OG S PACE in the size of the D (data complexity). Moreover, it runs in polynomial time in the size of M, in exponential time in the size of Q, and in polynomial time in the size of G.
Proof. The proof of the claim is a consequence of the correctness of the Algorithm Answer, established in Theorem 5.3.11, and the following facts: 1. the maximum number of atoms in the body of a conjunctive query generated by the Algorithm PerfectRef is equal to the length of the initial query Q; 2. by Lemma 5.3.12, the algorithm PerfectRef (Q, G) runs in time polynomial in the size of G; 3. by Lemma 5.3.12, the cost of applying RewDB to each conjunctive query in the union generated by PerfectRef has cost exponential in the size of the conjunctive query and polynomial in the size of M; which implies that the query to be evaluated over the data sources can be computed in time exponential in the size of Q, polynomial in the size of M and constant in the size of D (data complexity); 4. the evaluation of a union of conjunctive queries over a database can be computed in L OG S PACE with respect to the size of the database (since unions of conjunctive queries are a subclass of first-order logic queries).
Chapter 6
Updates of Ontologies at the Instance Level In this chapter, we study the notion of update of an ontology expressed as a DL knowledge base. We recall that DL knowledge bases consist of a TBox used to express the intensional level of the ontology, i.e. general knowledge about concepts and their relationships, and an ABox used to express the extensional level of the ontology, i.e. the state of affairs regarding the instances of concepts. In the first section, we introduce a (restricted) variant of DL-LiteA , called DL-LiteF S , that we use for expressing KBs in this chapter. Then, we provide the general framework for instance level update of DL ontologies, by specifying in particular, the formal semantics for update. Afterwards, we address the issue of update in the context of DL-LiteF S : we show that DL-LiteF S is closed with respect to instance level update, in the sense that the result of an update is always expressible by a new DL-LiteF S ABox. This has to be contrasted with the results in [69], which imply that, if we use more expressive logics, instance level updates are generally not expressible in the logic of the original knowledge base. Finally, we provide an algorithm for computing updates in DL-LiteF S and we discuss its formal and computational properties.
6.1 The DL-LiteF S language In this chapter, we consider a restricted variant of DL-LiteA , called DL-LiteF S , which differs from DL-LiteA as follows: ❒ DL-LiteF S does not allow for specifying attributes; consequently, it does not allow specifying value-domains, concept attributes, role attributes, ranges nor domains; ❒ it does not allow for specifying inclusions among roles, nor negation of roles, thus only basic roles occur in the KB; and ❒ it allows general concepts to occur in membership assertions. The first two differences have been introduced essentially for clarity purposes. Indeed, we strongly conjecture that results of this chapter hold for DL-LiteA as well. 85
CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL
86
Concerning the occurrence of general concepts in membership assertions, as we will see, itis due to the need for a more expressive language to reflect an update. However, we point out that in Chapter 3, we showed that given a DL-LiteF RS KB K with general expression in the ABox, it is possible to build a new KB Conv(K), in PT IME in data complexity, that is equivalent to K from the point of view of query answering. Clearly, this holds also for DL-LiteF S since it is clearly a restricted variant of DL-LiteF RS . Next, we specify more precisely the syntax of DL-LiteF S KB. Concepts in DL-LiteF S are defined as follows: B ::= A | ∃Q C ::= B | ¬B Q ::= P | P − where, as usual, A denotes an atomic concept, P an atomic role, B a basic concept, Q a basic role, and C a general concept. The universal assertions allowed in the TBox are of the form B1 ⊑ B2 inclusion assertion B1 ⊑ ¬B2 disjointness assertion and of the form (funct Q)
functionality assertions
Finally, the membership assertions allowed in a DL-LiteF S ABox are of the form: C(a),
Q(a, b),
C(z) membership assertions
6.2 Instance-level ontology update As already discussed in Section 1.4, several approaches to update have been considered in literature. Here, we essentially follow Winslett’s approach [87, 88]. The intuition behind such approach is the following. There is an actual state-of-affairs of the world of which however we have only an incomplete description. Such a description identifies a (typically infinite) set of models, each corresponding to a state-of-affairs that we consider possible. Among them, there is one model corresponding to the actual state-of-affairs, but we don’t know which. Now, when we perform an update we are changing the actual state-of-affairs. However, since we don’t really know which of our models corresponds to the actual state-of-affairs we apply the change on every possible model, thus getting a new set of models representing the updated situation. Among them, we do have the model corresponding to the updated on the actual stateof-affairs, but again we don’t know which. As for how we update each model, only the changes that are absolutely required to accommodate what is explicitly asserted in the update will be performed. Observe that this intuition is essentially the one behind most of the research on reasoning about actions. For example this vision is completely shared by Reiter’s variant of Situation Calculus [83]. See in particular [84] where possible worlds are considered explicitly and actions on such worlds correspond to the above description.1 1
Actually [84] studies also “knowledge producing actions” (i.e., sensing actions), which are more related with belief revision than update.
6.2. INSTANCE-LEVEL ONTOLOGY UPDATE
87
Winslett’s approach to update is also completely compatible with the proposal in [69, 17] where updates of a DL ABox without TBox is studied for several expressive DLs. Observe that in those works, since the TBox is not present2 , the intensional level of the ontology is not specified, and updates are only relative to the instance level represented by the ABox. Here, instead, we do consider the intensional level of the ontology, represented by a TBox, although, as we said above, we insist that such level is unchanging. Updates impact only the instance-level of the ontology, according to what specified in the update, in a way consistent with keeping the universal assertions in the intensional level. Before addressing in the next section updates over a DL-LiteF S KB, we next define the general framework for instance-level update of a DL ontology, provide preliminary definitions and specify formally the crucial notion of model update and update. Definition 6.2.1 (Containment between interpretations) Let I = (∆I , ·I ) and ′ ′ I ′ = (∆I , ·I ) be two interpretations (over the same alphabet). We say that I is contained in I ′ , written I ⊆ I ′ , iff I, I ′ are such that: ′
❒ if a ∈ AI then a ∈ AI , for every a ∈ ∆, and atomic concept A; ′
❒ if (a, b) ∈ RI then (a, b) ∈ RI , for every (a, b) ∈ ∆ × ∆ and atomic role R. We say that I is properly contained in I ′ , written I ⊂ I ′ , iff I ⊆ I ′ but I ′ 6⊆ I.
2
Definition 6.2.2 (Difference between interpretation) Let I = (∆I , ·I ) and I ′ = ′ ′ (∆I , ·I ) be two interpretations (over the same alphabet). We define the difference between I and I ′ , written I ⊖ I ′ , as the interpretation (∆I , ·I )I ⊖ I ′ such that: ′
′
′
′
❒ AI⊖I = AI ⊖ AI , for every atomic concept A; ❒ P I⊖I = P I ⊖ P I , for every atomic role P ; where S ⊖ S ′ denotes the symmetric difference between sets S and S ′ , i.e. S ⊖ S ′ = (S ∪ S ′ ) \ (S ∩ S ′ ). 2 Definition 6.2.3 (Model update) Let T be a TBox in a DL L, I a model of T , and F a finite set of membership assertions expressed in L such that M od(T ) ∩ M od(F) 6= ∅. We define the (result of) the update of I with F, denoted by U T (I, F), as follows: U T (I, F) = {I ′ | I ′ ∈ M od(T ) ∩ M od(F) and there exists no I ′′ ∈ M od(T ) ∩ M od(F) s.t. I ⊖ I ′′ ⊂ I ⊖ I ′ } 2 2 Or, if present, it is assumed to be acyclic. Acyclic TBoxes cannot always be used to model the intensional level of an ontology, since the abbreviations that they introduce can be eliminated without semantic loss. Naturally, since acyclic TBoxes may provide compact representation of complex concepts, they may have an impact on the computational complexity of reasoning.
CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL
88
Observe that U T (I, F) is the set of models of T and F whose difference w.r.t. I is ⊆-minimal, and that such a set is non-empty. Definition 6.2.4 (Update) Let T be TBox expressed in a DL L, M ⊆ M od(T ) a set of models of T and F a finite set of membership assertions expressed in L such that M od(T ) ∩ M od(F) 6= ∅. We define the (result of the) update with F, S of M T denoted as M ◦T F, as the following set of models: M ◦T F = I∈M U (I, F).2 Let K = hT , Ai be a knowledge base in L and F a finite set of membership assertions expressed in L such that M od(T ) ∩ M od(F) 6= ∅. With a little abuse of notation and terminology we will write K ◦T F to denote M od(K) ◦T F and talk about the update of K instead of talking about the update of the models M od(K) of K. A basic question arises from such definitions. Is the result of updating a knowledge base still expressible as a new knowledge base in the same DL? 3 Let us introduce the following definition. Definition 6.2.5 (Expressible update) Let K = hT , Ai be knowledge base expressed in a DL L and F a set of membership assertions expressed in L such that M od(T ) ∩ M od(F) 6= ∅. We say that the update of K with F is expressible in L iff there exists an ABox A′ expressed in L such that K ◦T F = M od(hT , A′ i) 2 The results in [69] show that, for several quite standard DLs, updates are not expressible in the original language of the knowledge base, even if TBoxes are not considered. Instead, in the case of DL-LiteF S we have the notable property that updates are always expressible in DL-LiteF S itself, as we show in the next section.
6.3 Computing updates in DL-LiteF S ontologies In this section, we address instance-level updates in DL-LiteF S as specified in the previous section. In particular: ❒ we show that the result of an update is always expressible within DL-LiteF S : i.e., there always exist a new DL-LiteF S ABox that reflects the changes of the update to the original knowledge base (obviously the TBox remains unchanged as required); ❒ we show that the new ABox resulting from an update can be automatically computed; ❒ finally, we show that the size of such an ABox is polynomially bounded by the original knowledge base, and moreover that it can be computed in polynomial time. Before starting the technical development illustrate the update on an example to gain some intuition on the problem. 3
Note that this question corresponds to the “expressible update problem” presented in Chapter 1.4 for DIS.
6.3. COMPUTING UPDATES IN DL-LITEF S ONTOLOGIES
89
Example 6.3.1 Consider the ontology presented in Example 3.1.5. Now suppose that Lenz is not anymore a manager: we update the ontology with the membership assertion ¬manager(Lenz). Based on the semantics presented above, the result of the update can be expressed by the following ABox: {¬(manager), employee (Lenz)}. Note that the new instance level reflects that Lenz is an employee who is not a manager. Interestingly, the fact that Lenz is not a manager implies that he does not manage anything anymore. Nevertheless, he remains an employee, and he still works for the project he used to amnage, and this would not be captured by simply removing the ABox assertions that are inconsistent with the update. 2 In Fig 6.1, we provide an algorithm to perform an update over a DL-LiteF S knowledge base. To simplify the presentation we make use of the following notation. First we denote by Q− the inverse of Q, i.e., if Q is an atomic role, then Q− is its inverse, while if Q is the inverse of an atomic role, then Q is the atomic role itself. Second, we write ¬C to denote ¬B if C is B, and B if C is ¬B. Also, we use the notation C1 ⊑ C2 to denote either assertions of the form B1 ⊑ B2 , B1 ⊑ ¬B2 , or ¬B1 ⊑ ¬B2 . Finally we denote by cl (T ) the deductive closure of T , that can be defined as the obvious generalization of cln(T ) presented in Section 4.2.3, i.e. cl (T ) is built from both positive and negative inclusions. Clearly, by following the same line of reasoning as for cln(T ), it can be shown that in DL-LiteF S , cl (T ) can be computed in polynomial time w.r.t. T . The algorithm in Fig. 6.1 takes as input a DL-LiteF S satisfiable knowledge base K = hT , Ai, and a finite set of ground (i.e., not involving soft constants) membership assertions F, and returns either ERROR (if hT , Fi is unsatisfiable), or an ABox A′ (otherwise). Roughly speaking, the algorithm proceeds as follows. After a satisfiability check, it inserts into A′ all the membership assertions in A and F (lines 3–4), and then uses the Algorithm PerfectRef, presented in Section 4.4, Fig. 4.4.2 to compute the set F ′ of membership assertions that are logically implied by K and contradict F according to T (lines 5–18) 4 . Finally, for each F ′ ∈ F ′ , the algorithm deletes F ′ from A′ , but inserts into A′ those membership assertions that are logically implied by the membership assertions deleted and do not contradict F (lines 19–32). Lemma 6.3.2 Let K = hT , Ai be a satisfiable DL-LiteF S knowledge base, F a finite set of ground DL-LiteF S membership assertions such that M od(T ) ∩ M od(F) 6= ∅, and K′ the DL-LiteF S knowledge base such that K′ = hT , A′ i, where A′ = ComputeUpdate(T , A, F). We have that K′ is always satisfiable.
Proof. By construction of the Algorithm, K′ is obtained from K by: ❒ inserting into A′ : 1. a finite set F of ground membership assertions; these are by hypothesis such that M od(T ) ∩ M od(F) 6= ∅; 4
Note that the Algorithm PerfectRef, as introduced in Fig. 4.4.2, returns a union of conjunctive queries. Clearly, since here we use it by giving it a ground term as input, then it returns a set of ground atoms, i.e. a set of ground membership assertions
90
CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL
INPUT: finite set of ground membership assertions F, satisfiable DL-LiteF S -KB hT , Ai OUTPUT: an ABox A′ , or ERROR [1] if hT , Fi is not satisfiable then ERROR [2] else for each F ∈ F do [3] if F = Q(a, b) then F := F ∪ {∃Q(a), ∃Q− (b)} [4] A′ := A ∪ F; F ′ := ∅ [5] for each F1 ∈ F do [6] if F1 = C(a) then [7] for each F ′ ∈ PerfectRef(¬C(a), T ) do [8] if F ′ = C ′ (a) and hT , Ai |= C ′ (a) then [9] F ′ := F ′ ∪ {C ′ (a)} [10] else if F ′ = ∃Q′ (a) then [11] F ′ := F ′ ∪ {Q′ (a, b) | Q′ (a, b) ∈ A} [12] else if F1 = Q(a, b) then [13] if (funct Q) in T then [14] for each b′ 6= b s.t. hT , Ai |= Q(a, b′ ) do [15] F ′ := F ′ ∪ {Q(a, b′ )} [16] if (funct Q− ) in T then [17] for each a′ 6= a s.t. hT , Ai |= Q(a′ , b) do [18] F ′ := F ′ ∪ {Q(a′ , b)} ′ [19] for each F ∈ F ′ do [20] if F ′ = C ′ (a) then [21] A′ := A′ \ {C ′ (a)} [22] for each C ′ ⊑ C1 in cl (T ) do [23] if (C1 (a) ∈ / F ′ ) then A′ := A′ ∪ {C1 (a)} ′ [24] if F = ∃Q(a) then [25] for each ∃Q− ⊑ C2 in cl (T ) do [26] A′ := A′ ∪ {C2 (z)}, with z new soft constant in V [27] else if F ′ = Q(a, b) then [28] A′ := A′ \ {Q(a, b), ∃Q(a), ∃Q− (b)} [29] for each ∃Q ⊑ C3 in cl (T ) do [30] if C3 (a) ∈ / F ′ then A′ := A′ ∪ {C3 (a)} [31] for each ∃Q− ⊑ C4 in cl (T ) do [32] if C4 (b) ∈ / F ′ then A′ := A′ ∪ {C4 (b)}
Figure 6.1: Algorithm ComputeUpdate(T , A, F)
6.3. COMPUTING UPDATES IN DL-LITEF S ONTOLOGIES
91
2. a finite set of membership assertions F ′′ that do not contradict F and are logically implied by K (such membership assertions are introduced into A′ either at line 24, or 28, or 29, or 33, or 34, or 39, or 42); these are therefore such that M od(T ) ∩ M od(F ′′ ) ∩ M od(F) 6= ∅; ′ } ❒ deleting from A′ the maximum finite set of membership assertions F ′ = {F1′ , ..., Fm that contradict F; these are therefore such that M od(T )∩M od(F ′ )∩M od(Fi′ ) = ∅, for each i such that Fi′ ∈ F and there exists no F ′ ∈ A \ F ′ such that M od(T ) ∩ M od(F ′ ) ∩ M od(F) 6= ∅.
Therefore we have that A′ = (A ∪ F ∪ F ′′ ) \ F ′ . Then, since by hypothesis K is satisfiable, i.e. M od(T ) ∩ M od(A) 6= ∅, we have that: M od(K′ ) = M od(T ) ∩ M od(A′ ) is satisfiable. Next, we deal with termination, soundness and completeness of the algorithm shown in Fig. 6.1. Lemma 6.3.3 (Termination) Let K = hT , Ai be a DL-LiteF S knowledge base, F a finite set of ground DL-LiteF S membership assertions. Then the algorithm ComputeUpdate(T , A, F) terminates, returning ERROR if M od(T ) ∩ M od(F) = ∅, and an ABox A′ such that hT , A′ i is a DL-LiteF S knowledge base, otherwise.
Proof. The termination of ComputeUpdate(T , A, F) follows directly from the termination of Algorithm PerfectRef. Next, we prove that the algorithm shown in Fig. 6.1 is sound and complete. Lemma 6.3.4 (Soundness) Let K = hT , Ai be a DL-LiteF S knowledge base, F a finite set of ground DL-LiteF S membership assertions such that M od(T )∩M od(F) 6= ∅, and K′ the DL-LiteF S knowledge base such that K′ = hT , A′ i, where A′ = ComputeUpdate(T , A, F). Then, for every model I ′ ∈ M od(K′ ), we have that: ∃I ∈ M od(K) s.t. I ′ ∈ U T (I, F).
Proof. Let A′ = ComputeUpdate(T , A, F), and let I ′ be a model of K′ = hT , A′ i. We show how to build an interpretation I that is a model of K. In particular, we start from I ′ and modify it in order to obtain an interpretation I that satisfies K. Then we prove that I ′ ∈ U T (I, F), i.e. I ′ is a model of T and F that is at the minimal distance from I. Suppose first that I ′ is a model of K. Then, the theorem trivially holds by taking I = I ′ . Suppose now that I ′ is not a model of K. Since I ′ is by hypothesis a model of T , this means that I ′ does not satisfy a set of membership assertions F ′ = {Fi | i = 1, · · · , n} ⊆ A. Then, by construction, Fi has been deleted from A, for i = 1, · · · , n. Let us now modify I ′ in order to make it satisfy each Fi in F ′ . Starting by considering i = 1, we repeatedly apply the function ModelSat to Ii , where I0 = I ′ , In = I and Ii is the interpretation that is returned by calling ModelSat(Ii−1 , Fi ) . Intuitively, ModelSat(Ii−1 , Fi ) modifies Ii−1 by changing only the interpretations of constants in Γ that contradict the satisfaction of Fi . More precisely, the computation of ModelSat(Ii−1 , Fi ) proceeds as follows. 1. First we set Ii = Ii−1 .
92
CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL 2. Second, we apply the following base rules. (a) If Fi = C(a) then we set aIi ∈ C Ii . (b) If Fi has the form Fi = Q(a, b), we set (aIi , bIi ) ∈ QIi , aIi ∈ ∃QIi and I bIi ∈ ∃Q− i . Moreover, if (funct Q) ∈ T , then for each (aIi−1 , b′Ii−1 ) ∈ / QIi , and if there exists no QIi−1 such that b′ 6= b we set (aIi , b′Ii ) ∈ I / ∃Q− i . a′′ 6= a such that (a′′Ii−1 , b′Ii−1 ) ∈ QIi−1 then we set b′Ii ∈ Respectively, if (funct Q− ) ∈ T , then for each (a′Ii−1 , bIi−1 ) ∈ QIi−1 such that a′ 6= a, we set (a′Ii , bIi ) ∈ QIi and if there exists no b′′ 6= b / ∃QIi . such that (a′Ii−1 , b′′Ii−1 ) ∈ QIi−1 then we set a′Ii ∈ 3. Third, we apply recursively the following rules. (a) If a ∈ B ′Ii , B ′ ⊑ C ∈ T and a ∈ / C Ii−1 , then set a ∈ C Ii . Note that this operation modifies I only if a ∈ B ′ Ii has been set in a previous step (otherwise, since Ii−1 is a model of T , if a ∈ B ′ Ii−1 then a ∈ C Ii−1 ). (b) If a ∈ ∃Q′ Ii and there exists no individual b ∈ ∆Ii−1 such that (a, b) ∈ Q′ Ii−1 , then add (a, b′ ) ∈ Q′ Ii and b′ ∈ ∃Q′ Ii , where b′ is an element of ∆Ii such that if (funct Q) ∈ T then there exists no a′′ s.t. (a′′ , b′ ) ∈ QIi−1 . Note that one such b′ always exists since otherwise, F ′ ⊆ A is not satisfiable, which both is not possible by hypothesis. / Q′ Ii , and if (c) If a ∈ ¬∃Q′ Ii , then for each (a, b) ∈ Q′Ii−1 we set (a, b) ∈ I / ∃Q− i . there exists no a′′ 6= a such that (a′′ , b) ∈ QIi−1 then we set b ∈
Clearly, by construction, I defined as above is a model of T . Also, I satisfies F ′ which is by hypothesis the set of membership assertions in A that are not in A′ . Moreover, I still satisfies all other membership assertions in A. In fact, suppose by contradiction that there exists F ′ ∈ A that is not satisfied by I. This means that by construction, in order to satisfy F ′ , I ′ needs to be modified so that F ′ is not satisfied anymore. But then, this means that F ′ logically implies ¬F ′ , which is not possible since F ′ and F ′ belong to A and K is by hypothesis satisfiable. Therefore I satisfies all membership assertions in A, which proves that I is a model of K. Now, in order to complete the proof, we need to show that I ′ ∈ U T (I, F). By hypothesis, I ′ is a model of T . Moreover, since F ⊆ A′ , then I ′ is a model of F. Let us now show that there exists no interpretation I ′′ 6= I ′ of T and F such that: ❒ I ′′ ∈ M od(T ∪ F), and ❒ I ⊖ I ′′ ⊂ I ⊖ I ′ . Suppose by contradiction that such an interpretation I ′′ exists. Then one of the following cases occurs: ′
′′
/ AI ; 1. either there exists a such that a ∈ AI , a ∈ AI and a ∈ ′′
′
2. or there exists a such that a ∈ / AI , a ∈ / AI and a ∈ AI ; ′′
′
3. or there exists (a, b) such that (a, b) ∈ QI , (a, b) ∈ QI and (a, b) ∈ / QI ;
6.3. COMPUTING UPDATES IN DL-LITEF S ONTOLOGIES
93 ′
′′
4. or there exists (a, b) such that (a, b) ∈ / QI , (a, b) ∈ / QI and (a, b) ∈ QI ; where A and Q denote resp. an atomic concept and a role. Let us consider one by one all the above possible cases, starting from the first. Since I has been obtained from I ′ by applying the function ModelSat as specified above, then it means that one of the following cases occurs. ❒ Either there exists F ′ ∈ F ′ such that F ′ = A(a), where F ′ ∈ A; in this case, since F ′ contradicts F, we have that A(a) ∈ PerfectRef(¬C(a)) for some ′′ ′′ ′′ ′′ C(a) ∈ F. Therefore, aI ∈ AI would imply that aI ∈ / C I , which would contradict that I ′′ is a model of F. ❒ Or a ∈ B ′I , because of the application of the function ModelSat. But then, ′ B ′ ⊑ A ∈ T and a ∈ / AI . This means that I was previously modified in order to satisfy an assertion in F ′ and it was necessary to have a ∈ B ′ I , in order to ′′ satisfy T . Therefore, again, we obtain a contradiction since either a ∈ / B′I ′′ and I ′′ is not a model of F ′ , or a ∈ B ′ I and I ′′ is not a model of T . ′′
′
Let us now consider the second case. If a ∈ / AI , a ∈ / AI and a ∈ AI , then we ′′ ′ have that a ∈ ¬AI , a ∈ ¬AI and a ∈ / ¬AI . Therefore, we can reduce this case to the previous one, and prove that we would obtain similarly a contradiction. Let us now suppose that I ′′ is such that there exists (a, b) such that (a, b) ∈ ′ ′′ I / QI . Then, since I has been obtained from I ′ by Q , (a, b) ∈ QI and (a, b) ∈ applying the function ModelSat, it means that I has been modified because one of the following cases occurs. ❒ Either I ′ does not satisfy an assertion F ′ = Q(a, b) ∈ F ′ , where F ′ ⊆ A. In this case, since F ′ contradicts F, we have that either (i) F ′ contradicts F because of a functionality assertion, or (ii) F ′ comes from the perfect reformulation of ¬C(a) for some C(a) ∈ F, which means that Q(a, b) logically implies ¬C(a). Suppose first that F ′ contradicts F because of a functionality assertion, e.g. (funct Q) ∈ T for some Q(a, b′ ) ∈ F, b 6= b′ . Then ′′ ′′ / QI which would contradict that I ′′ (a, b) ∈ QI would imply that (a, b′ ) ∈ is a model of F. Similarly, we would obtain a contradiction by supposing that (funct Q− ) ∈ T for some Q(a′ , b) ∈ F, a′ 6= a. Suppose now that F ′ contradicts F because it logically implies the negation of some assertion in F, e.g. ′′ ′′ / C I , which would contradict C(a). Then, (a, b) ∈ QI would imply that a ∈ that I ′′ is a model of F. ′
❒ Or, I is such that a ∈ ∃QI and there exists no b′ such that (a, b′ ) ∈ QI . But then, this means that I was previously modified to make I satisfy F and T . In particular, this means that ∃Q(a) is logically implied to an assertion F ′ ′′ contradicting F. Therefore, again, if Q(a, b′′ ) ∈ QI , for some b′′ , then we ′′ obtain a contradiction since we would have a ∈ ∃QI , which would would imply that I ′′ is not a model of F. Let us now consider the latter case, i.e. the case of (a, b) such that (a, b) ∈ / QI , ′ ′′ I I (a, b) ∈ / Q and (a, b) ∈ Q . By inspecting the function ModelSat we easily note ′ that the only cases when the interpretation of (a, b) is modified so that (a, b) ∈ QI and (a, b) ∈ / QI are the following.
94
CHAPTER 6. UPDATES OF ONTOLOGIES AT THE INSTANCE LEVEL ❒ Either F ′ = Q(a, b′ ) ∈ F ′ contradicts F ∈ F for some b′ 6= b where F = Q(a, b) and either (i) (f unctQ) ∈ T . By setting (a, b′ ) ∈ QI we must ′′ ′ / QI consequently set (a, b) ∈ / QI whereas (a, b) ∈ QI . But then, if (a, b) ∈ we obtain a contradiction since I ′′ is not a model of F. ❒ Or F ′ = Q(a′ , b) ∈ F ′ contradicts F ∈ F for some a′ 6= a because F = Q(a, b) and (f unctQ− ) ∈ T . This case is analogous to the previous one. ′
❒ Or (a, b) ∈ QI and a ∈ ¬∃QI . Suppose that a ∈ ¬∃QI and a ∈ ∃QI . This means that I was previously modified to satisfy F ′ and it was necessary to have a ∈ ¬∃QI , in order to satisfy T . Therefore, again, we obtain a contradiction ′′ ′′ since either a ∈ ∃QI and I ′′ is not a model of F ′ , or a ∈ / ∃QI and I ′′ is not a model of T . Similarly, we would obtain a contradiction by supposing that I I′ b ∈ ¬∃Q− and b ∈ ∃Q− . I
❒ Or (a, b) ∈ QI and b ∈ ¬∃Q− . This case is analogous to the previous one. Therefore, assuming that there exists an interpretation I ′′ such that I ′′ ∈ M od(T ∪ F), and I ⊖ I ′′ ⊂ I ⊖ I ′ leads to a contradiction, which proves that I ′ ∈ U T (I, F).
Lemma 6.3.5 (Completeness) Let K = hT , Ai be a DL-LiteF S knowledge base, F a finite set of ground DL-LiteF S membership assertions such that M od(T ) ∩ M od(F) 6= ∅, and K′ the DL-LiteF S knowledge base such that K′ = hT , A′ i, where A′ = ComputeUpdate(T , A, F). Then, for every model I ∈ M od(K), we have that: U T (I, F) ⊆ M od(K′ ).
Proof. To prove the theorem we proceed by assuming by contradiction that there exists an interpretation I ′′ ∈ U T (I, F) that is not a model of K′ . Then I ′′ does not satisfy at least one membership assertion F ′ in A′ \ F. We can suppose without loss of generality that there exists only one such assertion. By construction A′ contains all the assertions of A that do not contradict F and other assertions (introduced into A′ either at line 24, or 28, or 29, or 33, or 34, or 39, or 42 of the Algorithm ComputeUpdate) that are logically implied in K and do not contradict F. Then, suppose that we modify I ′′ in order to make it satisfy F ′ . We must consequently modify I ′′ to make it still satisfy T . This can be done by applying to I ′′ the function ModelReach(I ′′ , I, F ′ ). This function is similar to ModelSat (cf. proof of the Algorithm soundness) in that it basically modifies I ′′ by forcing it to satisfy F ′ and T . However, since here we aim at building a model of F ′ that is closer to I than I ′′ , ModelReach(I ′ , I, F ′ ) proceeds by performing possible choices as in I. Note that this is always possible since I is a model of F ′ . More precisely, the computation of ModelReach(I ′ , I, F ′ ) returns an interpretation I¯ by proceeding as follows. 1. First, we set I¯ = I ′ . 2. Second we modify I¯ in order to make it satisfy F ′ as follows. ¯
(a) If F ′ = C(a), then we set a ∈ C I .
6.3. COMPUTING UPDATES IN DL-LITEF S ONTOLOGIES
95 ¯
(b) If F ′ = C(z), then we find one constant b ∈ C I and we set b ∈ C I . Note that such a constant must exist. in fact, F ′ is inserted into A′ either at line 29 or at line 34. Suppose that it is inserted at line 29. Then, by hypothesis, we have that (i) a ∈ ∃QI , which implies that there exists b s.t. I (a, b) ∈ QI and b ∈ ∃Q− , and (ii) ∃Q− ⊑ B ∈ T , which implies that b ∈ B I . Note that the case in which F ′ is inserted at line 34 is analogous. ¯
¯
I¯
(c) If F ′ = Q(a, b), then we set (a, b) ∈ QI , a ∈ ∃QI and b ∈ ∃Q− . 3. Third, we apply recursively the following rule in order to make I¯ satisfy T . ¯
¯
(a) If a ∈ B ′ I , B ′ ⊑ C and a ∈ / C I , then a ∈ C I . Note that this operation ¯ modifies I¯ only if a ∈ B ′ I has been in a previous step (otherwise since ′′ I ′′ is a model of T , then a ∈ C I ). ′′
¯
I¯
(b) If a ∈ ∃Q′ I (resp. a ∈ ∃Q′ − ) and there exists no individual b ∈ Γ ′′ ′′ such that (a, b) ∈ Q′ I (resp. (b, a) ∈ Q′ I ), then for each (a, b′ ) ∈ Q′ I ¯ ¯ (resp. (b′ , a) ∈ Q′ I ), set (a, b′ ) ∈ Q′ I (resp. (b′ , a) ∈ Q′ I ). Note again ¯ that this operation modifies I¯ only if a ∈ ∃Q′ I has been set in a previous step. Moreover, in this case, there always exists at least one b′ such that (a, b′ ) ∈ Q′ I (resp. (b′ , a) ∈ Q′ I ) since I is a model of F ′ and I ′′ is modified in order to satisfy F ′ and everything that is logically implied by F ′. Clearly, by construction, the interpretation I¯ obtained as above is a model of T . Moreover, I¯ satisfies F ′ which is by hypothesis the only membership assertion of A′ that is not satisfied by F ′ . Moreover, I¯ still satisfies all other membership assertions in A′ . In fact, suppose by contradiction that there exists F ′′ ∈ A′ that is not satisfied ¯ This means that, by construction, in order to satisfy F ′ , I ′′ needs to be modified by I. so that F ′′ is not satisfied anymore. But then, this means that F ′ logically implies ¬F ′′ , which is not possible since F ′ and F ′′ belongs to A′ and K′ by Lemma 6.3.2 we know that K′ is satisfiable. Therefore I¯ satisfies all membership assertions in A′ , which proves that I¯ is a model of K′ . Finally, I ⊖ I¯ ⊂ I ⊖ I ′′ since by construction I¯ is obtained by modifying I ′′ so that I¯ interprets a set of objects as I (whereas I ′′ ¯ does not), and nothing that is interpreted in I ′′ as in I is interpreted differently in I. ′′ T ′′ ′ Therefore, by assuming that I ∈ U (I, F) and that I is not a model of K , we obtain that it is possible to build a model I¯ that is closer to I than I ′′ , which is a contradiction. From the two lemmas above, we get the following theorem, that sanctions the correctness of our algorithm. Theorem 6.3.6 Let K = hT , Ai be a DL-LiteF S knowledge base F a finite set of ground DL-LiteF S membership assertions such that M od(T ) ∩ M od(F) 6= ∅, and A′ = ComputeUpdate(T , A, F). Then K ◦T F ≡ M od(hT , A′ i). Interestingly, if we do not allow for DL-LiteF S membership assertions involving soft constants in the ABox, then we lose expressibility, as shown by the following example.
CHAPTER 6
96
Example 6.3.7 The TBox {∃P − ⊑ A1 , A2 ⊑ ¬∃P } and the ABox {∃P (a)} imply that there exists an object that is both a P -successor of a, and an instance of A1 . Now let us consider the update {A2 (a)}. As a result of the update, A2 (a) must be logically implied, hence we must remove ∃P (a) from the ABox, but the fact that there is an instance of A1 must remain logically implied after the update. It can be easily seen that to express this in the new ABox we must use A1 (z) where z is a new soft constant. 2 Note that a similar observation holds for membership assertions involving general concepts. Next we turn to the computational complexity of computing the update. By analyzing the algorithm we get: Theorem 6.3.8 Let K = hT , Ai be a DL-LiteF S knowledge base, F a finite set of ground DL-LiteF S membership assertions such that M od(T ) ∩ M od(F) 6= ∅, and A′ = ComputeUpdate(T , A, F). Then: ❒ the size of A′ is polynomially bounded by the size of T ∪ A ∪ F; ❒ computing A′ can be done in polynomial time in the size of T ∪ A ∪ F.
Proof. The proof of this theorem is an immediate consequence of the following observations: ❒ there is one call PerfectRef(A, T ) for each atom A in A; ❒ PerfectRef(q, T ) runs in polynomial time in the size of T , and in exponential time in the size of q; thus, in this case, the call PerfectRef(A, T ) has cost polynomial in T ; moreover, it produces a set of facts whose size is polynomial in the size of T ; ❒ for each F ∈ PerfectRef(A, T ), the check K |= F is L OG S PACE in A; ❒ for each F ′ ∈ F ′ the cost of eliminating F ′ from A′ is clearly polynomial in the size of A.
Part III
XML-based DIS
97
As we already discussed in Chapter 2, several data integration systems and theoretical works have been proposed for relational data, whereas not much investigation has focused yet on XML-based data integration, besides few exceptions (cf. Chapter 2, Fig. 2.1). Our goal in this part of the thesis is to address some of its issues. In particular, we highlight two major issues that emerge in the XML context: (i) the global schema may be characterized by a set of constraints, expressed by means of a DTD and XML integrity constraints, (ii) the concept of node identity requires to introduce semantic criteria to identify nodes coming from different sources. The latter is similar to the problem of identifying objects in mediators systems [78]. Given the importance of this issue for information integration, much work has recently been focusing in identifying records representing the same ”real-world entity” and reconciling them to obtain one record per entity (the so-called Entity Resolution [19], or Reference Reconciliation [38] problems). As we shall see, this problem requires some particular solution in the context of XML data integration. Let us first illustrate by an example XML-based data integration issues. Suppose that a hospital offers access to information about patients and their treatments. Information is stored in XML documents managed in different services of the hospital. However, because of privacy and security reasons, each user sees only parts of the data depending on her access rights. For instance, statisticians have access to the global schema SG having the form of the following DTD:
SG :
hospital (patient+, treatment+)> patient (SSN, name, cure*, bill*)> treatment (trID, procedure?)> procedure (treatment+)>
To simplify, and following a common approach for XML data, we consider XML documents as unordered trees, with nodes labeled with elements names. The above DTD says that the document contains data about patients and hospital treatments, where a cure is nothing but a treatment id. Moreover, a set of keys and foreign key constraints are specified over the global schema. In particular, we know that two patients cannot have the same social security number SSN, that two treatments cannot have the same number trID and that all the prescribed cures have to appear among the treatments of the hospital. Such constraints correspond respectively to two key constraints and one foreign key constraint. Finally, assume that the sources consist of the following two documents, D1 and D2 , with the following DTDs. 99
100
D1 :
Parker 55577 Rossi 20903
S1 :
D2 :
55577
S2 :
By means of mappings, we specify that D1 contains patients with a name and a social security number lower than 100000, and D2 contains patients that paid a bill and were prescribed at least one dangerous cure (we assume that these have numbers smaller than 35). Moreover, we specify that these mappings are sound, which means that D1 and D2 contain resp. a subset of all patients having a name and a social security number lower than 100000, and a subset of all patients having paid a bill and been prescribed a dangerous cure. Note that if we would have known that the sources contained exactly all specified patients, then the mappings would have been exact, instead of sound. Suppose now that a user asks for the following queries: 1. Find the name and the SSN for all patients having a name and a SSN, that paid a bill and that were prescribed at least one cure. 2. Does the hospital offer dangerous treatments? As usual in DIS, our goal is to find the certain answers, e.g. the answers that are returned for all data trees that satisfy the global schema and conform to the data at the sources. By adapting data integration terminology introduced in Chapter 1 we call them legal data trees. A crucial point here is that knowledge about legal data trees may be obtained by merging the source trees. An important issue is thus to identify nodes from different sources that correspond to the same entity of the real world, a process sometimes called entity resolution [19], or reference reconciliation [38]. In practice, entity resolution is typically based on machine learning. We abstract this part of the problem here by assuming that the identification of nodes from different sources, so the merging of the source trees, is based on constraints, and more precisely key constraints. One can think of these keys as being added by a separate entity resolution module. Note, however, that data retrieved may not satisfy these constraints. In particular, there are two kinds of constraints violation. Data may be incomplete, e.g. it may violate constraints by not providing all data required according to the schema. Or, data retrieved may be inconsistent, i.e. it may violate constraints by providing two elements that are ”semantically” the same but cannot be merged without violating key constraints. In this paper, we address the problem
101 of answering queries in the presence of incomplete data, while we will assume that data does not violate the constraints. Coming back to the example, one can verify that the sources are consistent. The global schema constraints specification allows to answer Query 1 by returning the patient with name ”Parker” and social security number ”55577”, since thanks to the key constraint we know that there cannot be two patients with the same SSN. Note that Query 2 can also be answered with certainty. Mappings let us actually infer that the patient named ”Parker” was prescribed a dangerous cure. In addition, thanks to the foreign key constraint, we know that every cure that is prescribed to some patient is provided by the hospital. We conclude the example by highlighting the impact of the assumption of having sound/exact mappings. Suppose that no constraints were expressed over the global schema. Under the exact mapping assumption, by inspecting the data sources, it is possible to conclude that there is only one way to merge data sources and satisfy the schema constraints. Indeed, since every patient has a name and a SSN number, we can deduce that all patients in D2 with a SSN lower than 100000 belong also to D1 . Therefore the answer to Query 1 would be the same as in the presence of constraints, whereas no answer would be returned to Query 2, since no information is given on that portion of the global schema. On the other hand, under the assumption of sound mappings, since in the absence of constraints there could be two patients with the same SSN, both queries would return empty answers. The main contributions of this part of the thesis are as follows. ❒ First, following the logical approach presented in Section 1.1, we propose a formal framework for XML data integration systems based on (i) a global schema specified by means of a set of (simplified) DTD and a set of XML integrity constraints as defined in [42], (ii) a source schema specified by means of DTDs, and (iii) a set of LAV mappings specified by means of prefix-selection-query language that is inspired from the query language defined in [6]. ❒ Second, we define the notion of identification function, and provides one such function that aims at globally identifying nodes coming from different sources. As already mentioned, the need for the introduction of identification is motivated by the concept of node identity. ❒ Third, we study XML DIS consistency decidability and study its complexity under different assumption for the mappings. ❒ Finally, we address the query answering problem in the XML data integration setting. In particular, given the strong connection with query answering with incomplete information, we propose an approach that is reminiscent of such a context. We provide two polynomial algorithms to answer queries under different assumptions, and study the complexity of general XML DIS query answering. This part of the thesis comes from an expansion and an updated version of a DBPL conference paper [80]. It is organized as follows. Below, we start by discussing related work.In Chapter 7, we introduce the setting. In particular, we present the data model, the schema language and the query language used in this part. Then,
102 the logical framework for XML data integration is introduced in Chapter 8, where we also define the notion of identification function, and provide one particular such function. Finally, in Chapter 9, we investigate query answering, study its complexity and propose different algorithms to answer queries under the assumption of sound, exact and mixed mappings.
Chapter 7
The setting In this chapter, we introduce preliminary definitions and propositions that we use all along this chapter. In particular, we start by presenting the data model for XML documents, and some properties of the model. Then, we define types for data, corresponding to simplified DTDs. We also introduce XML constraints that, together with types, form the schema language. Finally, we present the query language, that is an extension of the one introduced in [6].
7.1 Data model In this work, XML documents are represented as labeled unordered trees, called data trees, formally defined as follows. Definition 7.1.1 Let N be a set of node identifiers, Σ a finite set of element names ¯ ∪ {⊥} a domain for the data values, where the symbol ⊥ is a (labels), and Γ = Γ special data value that represents the empty value. A (data) tree T over Σ and Γ is a triple T = ht, λ, νi, where: ❒ t is a finite rooted tree (possibly empty) with nodes from N ; ❒ λ, called the labeling function, associates a label in Σ to each node in t; and ❒ ν, the data mapping, assigns a value in Γ to each node in t. The number of nodes in a data tree T is denoted |T |, whereas the depth of t is denoted d(T ). We call datanodes those nodes n of t such that ν(n) 6= ⊥. Example 7.1.2 In Fig. 7.1, we show three different data trees containing information about wards and patients admitted in an hospital. Note that only data values different from ⊥ are represented and are circled. Therefore, datanodes can be easily distinguished. We next introduce the notions of subsumption and equivalence. Intuitively, a data tree is subsumed by another tree, if all the information it contains may also be found in the other tree. And, two data trees are equivalent if they hold the same information content (up to replication). Indeed, two equivalent trees will be indistinguishable with the positive query language that we will consider. 103
CHAPTER 7. THE SETTING
104
hospital
hospital
ward
ward
Geriatric
Psychiatric
ward
ward
ward
Geriatric
Psychiatric
admitted admitted admitted
admitted
admitted
admitted
adminID adminID adminID
adminID 55577
adminID 29767
adminID
55577
14176
29767
(a) Data tree T1
(b) Data tree T2 hospital
ward
ward
ward
Geriatric
Geriatric
Psychiatric
admitted admitted admitted admitted
adminID 55577
adminID adminID adminID 29767 14176 55577
(c) Data tree T3
Figure 7.1: Data Model
Homomorphism, subsumption, equivalence We next define two notions that are crucial for this study. Definition 7.1.3 Let T = ht, λ, νi and T ′ = ht′ , λ′ , ν ′ i be two data trees and h a function from the nodes of t to the nodes of t′ . We say that h is a homomorphism from t to t′ if and only if h is a total function from the nodes of t to (a subset of) the nodes of t′ such that, for each n, n′ : ❒ if n is the root of t, then h(n) is the root of t′ ; and, if n is a child of n′ , then h(n) is a child of h(n′ ) in t′ ; we therefore say that h preserves the parent-child relationship; ❒ λ′ (h(n)) = λ(n); we say that h preserves the labeling; ❒ either ν(n) = ⊥ or ν(h(n)) = ν(n); we say that h preserves data. Definition 7.1.4 Let T = ht, λ, νi and T ′ = ht′ , λ′ , ν ′ i be two data trees. We say that T is subsumed by T ′ , written T ≤ T ′ , if and only if there exists a homomorphism from t to t′ . Moreover, we say that T is equivalent to T ′ , written T ≃ T ′ , if and only if T ≤ T ′ and T ′ ≤ T . Note that, according to the above definition, the empty tree, i.e. the tree that does not contain any node, denoted T∅ , is subsumed by all data trees.
7.1. DATA MODEL
105
Example 7.1.5 Let T1 = ht1 , λ1 , ν1 i, T2 = ht2 , λ2 , ν2 i and T3 = ht3 , λ3 , ν3 i be the data trees shown in Fig. 7.1(a), 7.1(b) and 7.1(c). It is easy to see that T2 and T3 are both subsumed by T1 . Moreover, T1 is not subsumed by the T2 , since there exists no homomorphism from t1 to t2 . Finally, T1 is subsumed by T3 , which means that T1 and T3 are equivalent. The following lemma provides an immediate algorithm for checking subsumption: Lemma 7.1.6 For each data trees T = ht, λ, νi, T ′ = ht′ , λ′ , ν ′ i, T ≤ T ′ if and only if either T is empty (1), or they are such that: ❒ They have roots r, r′ , respectively; λ(r) = λ′ (r′ ); and ν(r) = ⊥ or ν(r) = ν(r′ ) (2); ❒ Each subtree of r is subsumed by some subtree of r′ (3).
Proof. “⇒”: Suppose first that T ≤ T ′ . If T ′ is the empty tree then T is the empty tree, by definition of homomorphism. So, in this case, (1) holds. Suppose now that T ′ is not empty. Clearly, the definition of homomorphism also implies (2). Now, by considering h on the subtrees of the root, one can easily prove (3). “⇐”: Let us now prove by induction on the depth k of T that: (*) For each T, T ′ , if (1-2-3) hold for T, T ′ and d(T ) ≤ k, then there exists a homomorphism from T to T ′ , i.e., T ≤ T ′ . The basis of the induction is obvious by (1). Now suppose that (*) holds for some k and let T, T ′ satisfying (1-2-3) with |T | = k + 1. Let T1 , ..., Tn be the distinct subtrees of the root of T . For each i, Ti is subsumed by some subtree of the root of T ′ . By induction hypothesis, there exists a homomorphism hi from Ti to that subtree. Let h be the function that transforms the root of T to that of T ′ and that coincides with hi on each Ti . One can verify that h is a homomorphism from T to T ′ . By induction, this shows that (*) holds for each k. We now study the complexity of subsumption. Proposition 7.1.7 Let T ′ and T ′′ be two data trees. One can check whether T ′ ≤ T ′′ , in time O(|T ′ | ∗ |T ′′ |).
Proof. (sketch) Let c be such that we know that for each T ′ of depth less or equal to k, and for each T ′′ , one can check T ′ ≤ T ′′ in time c|T ′ ||T ′′ |. (Ignore to simplify data trees of empty size.) Let T ′ be a tree of depth k + 1. The main issue is the cost of comparing the subtrees T1′ , ..., Tk′ of the root of T ′ to the subtrees T1′′ , ..., Tl′′ of the root of T ′′ . By induction, comparing Ti′ to Tj′′ can be performed in c|Ti′ ||Tj′′ |. Then the cost of comparing the subtrees is: Σi∈[1..k] Σj∈[1..l] (c × |Ti′ | × |Tj′′ |) ≤ c × Σi∈[1..k] (|Ti′ | × Σj∈[1..l] (|Tj′′ |)) ≤ c × Σi∈[1..k] (|Ti′ | × |T ′′ |) ≤ c × |T ′ | × |T ′′ |.
CHAPTER 7. THE SETTING
106 This concludes the proof.
As a direct consequence of the previous proposition and the definition of equivalence of two data trees, we have the following: Corollary 7.1.8 Let T ′ and T ′′ be two data trees. One can check whether T ′ ≃ T ′′ , in time O(|T ′ | ∗ |T ′′ |). To conclude with subsumbtion and equivalence, we observe the following properties: Proposition 7.1.9 (i) Subsumption is transitive; (ii) Equivalence is reflexive, symmetric and transitive, i.e., is an equivalence relation.
Proof. (Subsumption) Let T ′ , T ′′ and T ′′′ be such that T ′ ≤ T ′′ and T ′′ ≤ T ′′′ . Then, by definition, there exists homomorphism h12 from T ′ to T ′′ and h23 from T ′′ to T ′′′ . Let h13 be the function over the nodes of T defined by: for each node n, h13 (n) = h23 (h12 (n)). It is easy to verify that h13 preserves the root, the labelling and the data. Therefore, that h13 is a homomorphism, so T ′ ≤ T ′′′ . (Equivalence) Reflexivity and symmetry are by definition. Transitivity comes from the transitivity of subsumption.
Tree prefixes, Minimality Consider two equivalent data trees. Clearly, it may be the case that one of them contains a lot of replication whereas the other does not. One would prefer in practice, to use a “minimal” data tree. The lack of redundancy is captured by the following two definitions. Definition 7.1.10 A data tree T ′ = ht′ , λ′ , ν ′ i is a prefix of T = ht, λ, νi if and only if: ❒ t′ is such that: – the root r′ of t′ is the root of t; – every subtree of t′ rooted at a child of r′ is a prefix of a subtree of t rooted at a child of r; ❒ λ′ and ν ′ are resp. the restrictions of λ and ν to the nodes of t′ . Clearly, we have the following lemma. Lemma 7.1.11 For each T , and each prefix T ′ of T , we have that: T ′ ≤ T . Definition 7.1.12 Let T be a data tree. We say that T is minimal if there is no prefix of T , other than T itself, that is equivalent to T . Example 7.1.13 Let us consider again the data trees T1 = ht1 , λ1 , ν1 i and T3 = ht3 , λ3 , ν3 i. One can see that T3 is not minimal whereas, T1 is.
7.1. DATA MODEL
107
Let T = ht, λ, νi be a data tree. We will use the algorithm Minimal(T ) that takes as input T and returns some tree by proceeding as follows: 1. minimize the subtrees of the root; 2. select randomly one subtree of the root that is subsumed by another one and remove it, until there is no subsumed subtree. We next see that this algorithm constructs a minimal subtree that is equivalent to T in quadratic time: Proposition 7.1.14 Given a tree T , one can construct the data tree Minimal(T ) that is equivalent to T and minimal, in PT IME with respect to the size of T .
Proof. (sketch) By construction and by Lemma 7.1.6, Minimal(T ) is equivalent to T . Suppose that it is not minimal. Then for some node n in the tree, some subtree would be redundant, a contradiction with the construction. For the complexity, the proof is by induction on the number of nodes in the tree. Suppose that for some c, the complexity of minimalizing a data tree is c × |T |2 for some constant c, for all trees of size less than k. Consider a tree of size n. We have to minimize its subtrees T1 , ..., Tk , which costs: Σj∈[1..k] c × |Tj |2 ≤ c × |T |2 Note that we also have to test equivalence but that is polynomial by Corollary 7.1.8. We also have: Proposition 7.1.15 For each equivalence class of data trees, there exists a minimal element that is unique up to isomorphism (i.e., up to renaming node ids).
Proof. The existence of a minimal tree follows from Proposition 7.1.14. For uniqueness, suppose there are two such minimal trees T, T ′ . Since T ≃ T ′ , there exists homomorphisms h from T to T ′ and h′ from T ′ to T . First suppose that h′ (T ′ ) = T . Then T, T ′ are isomorphic. Now suppose that h′ (T ′ ) ⊂ T (strict subset). Then h′ (h(T )) ⊆ h′ (T ′ ) ⊂ T . Then one subtree of T is redundant, a contradiction with the minimality of T . Thus, h(T ′ ) ⊂ T is not possible. Hence, T, T ′ are isomorphic. Based on the previous results, we assume without loss of generality that all the trees we consider from now on are minimal unless explicitly said.
Intersection To conclude the presentation of the data model, we consider a last notion, namely intersection. Definition 7.1.16 Let T ′ and T ′′ be two data trees. The intersection of T ′ and T ′′ , denoted T ′ ∩ T ′′ , is the largest subtree that is smaller than both, i.e., it is a tree T such that: (i) T ≤ T ′ , T ≤ T ′′ , and (ii) for each T¯, if T¯ ≤ T ′ and T¯ ≤ T ′′ , then T¯ ≤ T .
CHAPTER 7. THE SETTING
108
We will see that for each pair of data trees, their intersection always exists and is unique up to equivalence. Example 7.1.17 Let us consider the data trees T1 and T2 , resp. in Fig. 7.2(a) and 7.2(b). They contain data about patients and treatments of an hospital. In Fig. 7.2(c) we show hospital patient treatment treatment
patient
SSN name cure cure SSN name trID 20903 Rossi 32 55577 Parker 32 11
trID 11
(a) Data tree T1 hospital
hospital patient
patient treatment treatment
SSN name cure bill SSN name trID 55577 Parker 25 2000 55577 Rossi 12
trID 13
patient
patient treatment
SSN name cure SSN name trID Rossi 55577 Parker
(b) Data tree T2
(c) T3 = T1 ∩ T2
Figure 7.2: Data Model the intersection T1 ∩ T2 . One can verify that there exists no tree that is subsumed by both T1 and T2 and is not subsumed by T1 ∩ T2 . Let T ′ = ht′ , λ′ , ν ′ i and T ′′ = ht′′ , λ′′ , ν ′′ i be data trees with resp. roots r′ , r′′ . We next show that their intersection is constructed by the recursive function Intersection(T ′ , T ′′ ) as follows: ❒ If λ′ (r′ ) 6= λ′′ (r′′ ), then T ∩ = T∅ . ❒ Otherwise, T ∩ = ht∩ , λ∩ , ν ∩ i, where: – the root of t∩ is a new node that inherits the labels of the two roots; moreover, if both roots are datanodes having the same value, the root of t∩ inherits their value, otherwise, the value ⊥; – the subtrees of the root of T ∩ is the set of trees: {Intersection(Ts′ , Ts′′ ) | Ts′ a subtree of the root of T ′ Ts′′ a subtree of the root of T ′′ } Note that the function above does not return a minimal tree. However, it is immediate to build from the returned data tree the minimal tree that is equivalent to it, by simply applying the Algorithm Minimal(T ∩ ) defined previously. We have the following result:
7.1. DATA MODEL
109
Proposition 7.1.18 Given two data trees T ′ and T ′′ , Intersection(T ′ , T ′′ ) is an intersection of T ′ ,T ′′ , and can be computed in quadratic time.
Proof. To show that T ∩ = Intersection(T ′ , T ′′ ) is an intersection of T ′ and T ′′ , we have to prove that T ∩ satisfies the two properties (i) and (ii), of the definition of intersection of T ′ and T ′′ . By construction, T ∩ clearly satisfies (i). Let us now ¯ ν¯i be such that T¯ ≤ T ′ and T¯ ≤ T ′′ . We show that consider (ii). Let T¯ = ht¯, λ, T¯ ≤ T ∩ . Since T¯ ≤ T ′ and T¯ ≤ T ′′ , there exist two homomorphisms h1 , h2 from T¯ to T ′ , T ′′ respectively. Let h be the function from t¯ to t∩ recursively defined as follows: ❒ h(¯ r) = r∩ where r¯ is the root of t¯ and r∩ is the root of t∩ . Note that h preserves the parent-child relationship for r¯, since r¯ and r∩ are both roots. ¯ r) = λ′ (r′ ) = λ′′ (r′′ ), where Moreover, since T¯ ≤ T ′ and T¯ ≤ T ′′ , then λ(¯ r′ , r′′ are resp. the roots of t′ , t′′ . But then, from the construction of T ∩ , ¯ r) = λ∩ (r∩ ), which means that h preserves the label of r¯. we have that λ(¯ Similarly, if ν¯(¯ r) 6= ⊥ then we must have ν¯(¯ r) = ν ′ (r′ ) = ν ′′ (r′′ ), and then ∩ ∩ ν¯(¯ r) = ν (r ). On the contrary, if at least one among r′ , r′′ is not a datanode, then ν¯(¯ r) = ⊥. Therefore, h preserves the data mapping of r¯. ❒ for every child n ¯ of r¯, let t¯′ be the subtree of t¯ rooted at n ¯ . Since T¯ ≤ T ′ , then h′ (t¯′ ) ⊆ t′s , where t′s is a subtree of T ′ rooted at a child n′ of r′ . Similarly, since T¯ ≤ T ′′ , then h2 (t¯′ ) ⊆ t′′s , where t′′s is a subtree of T ′′ rooted at a child n′′ of r′′ . Then, Intersection(T ′ , T ′′ ) 6= T∅ , where Ts′ , Ts′′ are the restriction of T ′ , T ′′ to the nodes of t′s , t′′s respectively. We can therefore define h(t¯′ ) = ts , where ts is the subtree of r∩ such that ts = Intersection(Ts′ , Ts′′ ), which proves that h preserves parent-child relationship. From the previous construction, it is clear that h is a homomorphism from t′ to t∩ . And therefore, we have that T ′ ≤ T ∩ . In order to prove that Intersection(T ′ , T ′′ ) runs in time O(N ′ ∗ N ′′ ), where ′ N , N ′′ are resp. the number of nodes of T ′ , T ′′ , again, we would proceed by induction. We omit the details, since the proof is very similar to the one of the complexity of checking subsumption (cf. Proof of Proposition 7.1.7).
Proposition 7.1.19 Given two data trees, their intersection always exists and is unique up to tree equivalence.
Proof. The existence of an intersection of two data trees follows directly from the previous proposition. To show uniqueness, let T1∩ , T2∩ be two intersections of T ′ , T ′′ . By definition of intersection for T1∩ , T1∩ ≤ T ′ and T1∩ ≤ T ′′ . By definition of intersection for T2∩ , T1∩ ≤ T2∩ . By symmetry, T2∩ ≤ T1∩ , so T1∩ and T2∩ are equivalent.
CHAPTER 7. THE SETTING
110 hospital +
+
patient SSN name
treatment * cure
* bill
? trID
procedure
Figure 7.3: Example of a tree type
7.2 Tree Type Let Σ be an alphabet. A tree type over Σ is a simplified version of DTDs that can be represented as a triple hΣ, r, µi, where Σ is a set of labels, r ∈ Σ is a special label denoting the root, and µ associates to each label a ∈ Σ a multiplicity atoms µ(a) representing the type of a, i.e. the set of labels allowed for children of nodes labeled a, together with some multiplicity constraints. More precisely, µ(a) is an expression aω1 1 ...aωk k , where ai are distinct labels in Σ, and ωi ∈ {∗, +, ?, 1}, for i = 1, ...k. We say that a data tree T over Σ satisfies a tree type S = hΣ, r, µi over ΣG , noted T |= S, if and only if: (i) the root of T has label r, and (ii) for every node n of T such that λ(n) = a, if µ(a) = aω1 1 ...aωk k , then all the children of n have labels in {a1 ..ak }, and the number of children labeled ai is restricted as follows 1 : ❒ if ωi = 1, then exactly one child of n is labeled with ai ; ❒ if ωi =?, then at most one child of n is labeled with ai ; ❒ if ωi = +, then at least one child of n is labeled with ai ; ❒ if ωi = ∗, then no restrictions are imposed on the children of n labeled with ai . Given a tree type, we call collection of elements ai a label a such that there is an occurrence of either a∗i or a+ i in µ(a), for some ai ∈ Σ. Moreover ai is called member of the collection a. Example 7.2.1 Consider the DTD SG from Section III. SG corresponds to the tree type hΣ, r, µi such that r =hospital and µ can be specified as follows: µ(hospital) = patient+ treatment+ µ(patient) = SSID name cure∗ bill∗ µ(treatment) = trID procedure? In Fig. 7.3 we show a graphical representation of SG . Note that patient and treatment are both members of the collection hospital. 1
One could also consider allowing a fixed number of children labeled ai . To simplify this will be ignored here.
7.3. CONSTRAINTS AND SCHEMA LANGUAGE
111
7.3 Constraints and schema language We next recall and adapt to our setting the definition of XML constraints from [42, 22], and introduce our schema language. Let τ be a tree type over an alphabet Σ. Unary keys(UK) are assertions of the form: a.k → a, where a ∈ Σ and k 1 ∈ µ(a). We thus say that k is a key for a. The semantics of keys is the following. Given a tree T satisfying S, T |= a.k → a if and only if: ❒ each node labeled a has a single child labeled k and this node is a datanode; ❒ for two distinct nodes labeled a, their respective children labeled k have distinct data values. Example 7.3.1 Consider the tree type SG in Fig. 7.3. In order to constrain every data tree satisfying SG to be such that there does not exist any two distinct nodes labeled patients having the same SSN, we use the following UK: patient.SSN → patient Note that the above UK is satisfied by the data tree in Fig. 7.2(a), whereas it is not satisfied by the data tree in Fig. 7.2(b). Foreign keys are assertions of the form: a.h ⊆ b.k, where k is a key for b, a ∈ Σ and hω ∈ µ(a) for some ω ∈ {1, ?, +, ∗}. The semantics of foreign keys is the following. Let T be a tree satisfying S. Then, T |= a.h ⊆ b.k if and only if for every datanode m labeled h that is a child of a node n labeled a there exists a node n′ labeled b having a single child m′ labeled k with the same data value of m. A FK a.h ⊆ b.k may be seen as introducing in nodes labeled a some reference to some nodes labeled b. Now, by definition, nodes labeled b may occur ’anywhere’ in the document. Also, even if it is possible to design documents in that manner, it seems very natural to group all b’s in a single place of the documents (as often done in practice). This motivates the following definition of uniquely localizable foreign key. The general case of arbitrary foreign keys is more complicated and left for future research. For a tree type S, we call uniquely localizable foreign keys (ULFK for short), a foreign key a.h ⊆ b.k such that there exists a unique path r, l1 , .., ls , b verifying: (i) for each document of tree type S and for each node b in this document, the path constituted by the labels from the root to b is r, l1 , .., ls , b, and (ii) no li on this path is a member of a collection, for i ∈ {1, .., s}. It is easy to see that as a consequence, in each document satisfying S, the elements labeled b are the children of a unique node.
CHAPTER 7. THE SETTING
112 hospital + patient * * SSN name cure bill
+ treatment
* ward ?
trID
procedure
* admitted
admID
Figure 7.4: Another example of tree type
Example 7.3.2 Consider the tree type shown in Fig. 7.4, and the following foreign keys: patient.cure ⊆ treatment.trID, patient.SSN ⊆ admitted.admID, where trID, and admID are respective keys for elements labeled treatment and admitted. The first assertion specifies the constraint of Section III, i.e. it specifies that whenever a cure has been prescribed to a patient, its identifier must appear among the identifiers of the treatments offered by the hospital. The second assertion specifies that whenever a patient SSN appears among the patients of the hospital, then it was admitted in some hospital ward. Note that the first foreign key is a ULFK, whereas the second is not, since ward is a member of the collection hospital, i.e. ward∗ ∈ µ(hospital), and ward is on the label path from the root to the elements labeled admitted, referenced by the foreign key. Finally, let SG be a tree type, ΦK a set of keys and ΦF K a set of foreign keys. A schema is a triple G = hSG , ΦK , ΦF K i. Moreover, we say that a tree T strongly satisfies the schema G if and only if T |= SG , T |= ΦK and T |= ΦF K . On the other hand, we say that T weakly satisfies the schema G = hSG , ΦK , ΦF K i, written T |=w G, if and only if there exists T ′ such that T ≤ T ′ and T ′ satisfies G. Intuitively this means that T may be incomplete wrt to G but not inconsistent. Clearly, if T satisfies G, then T weakly satisfies G, whereas the reverse does not hold. From now on, when we talk about satisfaction, we mean strong satisfaction, unless differently specified. Example 7.3.3 Let us come back to the example illustrated in Section III. The hospital data that we want to represent is such that it satisfies a schema G = hSG , ΦK , ΦF K i, where SG is the tree type in Fig. 7.3, and ΦK and ΦF K are resp. the set of key constraints, and foreign key constraints that follow: ΦK : ΦF K
{patient.SSN → patient; treatment.trID → treatment} : {patient.cure ⊆ treatment.trID}
Clearly, the tree T1 in Fig. 7.2(a) satisfies the schema G = hSG , ΦK , ΦF K i, whereas the tree T2 in Fig. 7.2(b) does not since (i) the first key constraint is violated (the data tree contains two patients with the same SSN), and (ii) the foreign key is violated
7.4. PREFIX QUERIES
113
(the cure with id 25 does not appear among the treatment ids). Finally, the tree T3 of Fig. 7.2(c) is an example of tree that weakly satisfies G but does not satisfy G, since the node patient corresponding to the patient named Rossi has not any child datanode SSN. Indeed, T3 is subsumed by T1 , which satisfies G.
7.4 Prefix Queries We introduce now the prefix query language that we use all along this work. This is an extension of prefix-selection queries presented in [6]. Intuitively, prefix queries (shortly referred as p-queries) browse the input tree starting from the root and going down to a certain depth, by traversing nodes with specified labels and data values satisfying specified conditions. Whereas boolean p-queries check for the existence of a certain tree pattern in T , general p-queries return a tree that is equivalent to a ”prefix projection” of the nodes selected by the query. We are now able to formally define p-queries. A p-query q over an alphabet Σ is a quadruple htq , λq , condq , retq i where: ❒ tq is a rooted tree; ❒ λq associates to each node a label in Σ, where sibling nodes have distinct labels; ❒ condq is a total function that associates to a node of tq a boolean formula, called condition, having either the form ⊤, which evaluates to true for all possible values in Γ, or the form p0 b0 p1 b1 ...pm−1 bm−1 pm , where pi are predicates ¯ for and bj are boolean operators such that (i) pi can be applied to values in Γ; ′ instance, if Γ = Q, pi can have the form op v, where op ∈ {=, 6=, ≤, ≥, } and v ∈ Q; and (ii) pi return false when applied to ⊥; ❒ retq (for returned by q) is a total function that assigns to each node nq in tq a boolean value such that: (i) retq (nq ) = true, if nq is the root of tq , and (ii) if retq (nq ) = false then retq (n′q ) = false, for every children n′q of nq . By analogy with data trees, we denote as d(q) the depth of tq . Let q = htq , λq , condq , retq i be a p-query. If there is at least one node nq ∈ tq such that retq (nq ) = false and the parent pq of nq is such that retq (pq ) = true, then we say that q contains an existential subtree pattern rooted at nq . Moreover, we say that q is a boolean p-query if retq (nq ) = true, only for the root of tq . We next formalize the notion of answer to a p-query using the auxiliary concepts of valuation. Given a p-query q = htq , λq , condq , retq i and a data tree T = ht, λ, νi, a valuation γ from q to T is a total function from the nodes of tq to T , preserving the parent-child relationship, the labeling, and such that for each nq ∈ tq , ν(γ(nq )) satisfies condq (nq ). Observe that γ(q) is a prefix of t. Let us call image of q posed over T , denoted I(q, T ) the tree hti , λi , νi i such that: ❒ ti is consists of all the nodes of T that are in γ(q) for some valuation from q to T; ❒ λi and νi are resp. the restrictions of λ and ν to the nodes in ti .
CHAPTER 7
114 Similarly, we call answer the tree q(T ) = htA , λA , νA i such that:
❒ for each n ∈ tA , there exists a valuation γ such that γ(n0 ) = n for some n0 ∈ tq such that retq (n0 ) = true; ❒ λp and νp are resp. the restrictions of λ and ν to tp . Clearly, by construction, Image(q, T ) and q(T ) are both prefixes of T , and q(T ) is a prefix of Image(q, T ). Intuitively, Image(q, T ) represents the prefix of T whose nodes are selected by q, whereas P (T ) is the prefix of I(T ) whose nodes are returned by q. Thus, by construction and by Lemma 7.1.11 we have the following. Lemma 7.4.1 Given a p-query q and a data tree T over Σ, q(T ) is unique (up to tree equivalence). Moreover, q(T ) ≤ Image(q, T ) ≤ T . Observe the following. Let q be a boolean p-query. Then either there exists no valuation from q to T and therefore Image(q, T ) = T∅ , or Image(q, T ) = htr , λr , νr i, where tr is a tree containing only the root r having the same label and data value as the root of T . Suppose that the first case occurs. Then, q(T ) = T∅ and the answer to q over T is T∅ . This means that T does not satisfy q. Suppose now that Image(q, T ) = htr , λr , νr i. Then, q(T ) = htr , λr , νr i, which means that T satisfies q. The tree containing only the root is therefore equivalent to true, whereas the empty tree T∅ is equivalent to false. Note that this is in the same spirit of the relational model where a boolean query returns the emptyset (∅) when the query is evaluated to false, and it returns the set containing the empty tuple ({()}) when the query is evaluated to true. Example 7.4.2 In Fig. 7.5 we show several p-queries. We graphically represent an existential subtree pattern in a query by underlying the label of its root. Moreover, only conditions different from ⊤ are represented. In particular, Fig. 7.5(a) shows a boolean query asking whether there are patients that were admitted in the ward ”Geriatric”. Posed over the data tree in Fig. 7.1(a), this query returns true. Consider now the queries in Fig. 7.5(b) and 7.5(d). They select respectively (i) the name and the SSN of patients having a SSN smaller than 100000, (ii) the SSN of patients that were prescribed at least one dangerous cure (i.e. a cure with id lower than 35), together with the bills they paid. The answers to these last two queries, when they are posed over the tree of Fig. 7.2(a), are given resp. in Fig. 7.5(c) and 7.5(e). Clearly, by the definition of p-queries, we have the following: Proposition 7.4.3 p-queries are monotone, i.e. for every two data trees T ′ , T ′′ , if T ′ ≤ T ′′ then q(T ′ ) ≤ q(T ′′ ).
hospital
hospital
hospital ward =“Geriatric”
patient
patient
patient
admitted
SSN name