Answering queries across mappings - FORTH-ICS

1 downloads 0 Views 257KB Size Report
Jul 5, 2005 - [2] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. ... [4] A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive ...
Answering queries across mappings Grigoris Karvounarakis Department of Computer and Information Science University of Pennsylvania 3330 Walnut St., Philadelphia, PA 19104 [email protected] July 5, 2005

Abstract A common situation in data management involves the existence of related data (e.g., results of biological experiments) that are organized according to different schemas. Data integration aims to provide uniform access to such heterogeneous data sources, by taking advantage of mappings between the source schemas and virtual (often mediated) schemas, which express the relationships between different concepts. Such mappings are often defined as containment constraints between queries or by logical assertions in the form of dependencies. In this context, the central technical issue is the ability to answer queries over virtual schemas, using the data of the sources and the mappings. One approach to this problem is query reformulation, which produces a rewriting of the original query that can be evaluated over the data sources. The inverse rules technique is a fundamental tool that is used to compute a solution to this problem in different settings. On the other hand, data exchange employs the chase technique in order to compute an instance for the virtual schema, using the mappings and data in the sources. This resulting instance can be directly queried to evaluate queries over the target schema directly on the data exchange solution. In this paper we compare the query reformulation and data exchange approaches for query answering, for an interesting class of mappings and queries. Moreover, we illustrate extensions of the basic problem setting, in the form of Peer Data Management Systems, in which the query reformulation has been applied. Finally, we attempt a closer investigation of the inverse rules and the chase techniques, that reveals a tight relationship between them.

1

Introduction

The existence of multiple different schemas describing related data is a common (and often desirable) phenomenon in data management. In a simple scenario, views are often defined over database schemas in order to provide different levels of access to users with different priviledges. In a distributed setting, members of the same community (e.g., biologists) often use several schemas to describe their data (e.g., because they cannot agree on a common 1

schema or because there is a large amount of existing data that are described according to those schemas). However, none of these users has all of the related data in their local store, organized under their own schema (and some do not have any local data at all) At the same time, they all want uniform access to all the knowledge of the community, e.g., in order to evaluate some query. In order to facilitate some reconciliation between the different schemas, they specify mappings that express relationships between the different concepts, between their schemas and either some commonly agreed “mediated” schema (in a two-tier architecture), or the schemas of some of other peers directly (in a peer-to-peer setting). However, mediated schemas are virtual, in that there are no real data sources that are described according to them. In the P2P setting, some of the peer schemas may be virtual while others describe data sources, but again, no peer is expected to have all relevant data stored in its local data store; nevertheless, users would like to have uniform access to the contents of all data sources. In both cases, the object of data integration [17] is to use these mappings in order to provide uniform access to such heterogeneous data sources; in order to achieve this, the central technical issue is the ability to answer queries across mappings. Given some heterogeneous data sources, source and virtual schemas, mappings between the schemas and a query over one of the virtual schemas, one would like to compute all “relevant” answers to the query using information from all data sources. In all these situations, the basic “building block” of the problem setting is the following: there are two schemas, S and T , called the source and target schemas, as well as a mapping M between them. Then, given an instance I of S, the goal of query answering is to evaluate a query Q over T , using the data in I. As we are going to illustrate later, the mediated setting considered above can be accommodated by considering S to be the union of the source schemas and T to be the mediated schema. There are two main approaches in terms of languages used to express schema mappings: the first uses containment between queries over S and T to specify mappings between them in its most general form (GLAV [11]). Papers following this approach have focused on several special cases of practical interest, (GAV [13]/LAV [18]), by considering one of S and T as a view schema of the other [15, 17]. In this context, query answering is similar with the problems of query optimization or providing physical data independence under materialized views. Another approach employs logical assertions (or dependencies) on the schemas. This approach also has its roots in query optimization under materialized views and integrity constraints [6]. Interestingly, the most general forms of the two mapping languages (GLAV vs. tgds, explained in the next section) are equivalent. Because of the use of views in mappings and also of its relation with other problems involving views, query answering across mappings has often appeared in the database literature under the name “answering queries using views” [15], mostly in papers that use containment between views as their mapping language. A very general formulation of this problem is: given a query Q over a schema T and a set of view definitions V1 , . . . , Vn over the same schema, is it possible to answer Q using only the answers to the views (i.e., without any knowledge of the contents of the actual database relations, which may be virtual, as explained above). Adapted to the context of data integration, given a query Q over a (virtual) schema T and some mappings between T and a set of data sources S1 , . . . , Sn , query answering is the 2

problem of answering Q using data from these sources. Note that query answering depends on the contents of the sources and does not pose any limitation on how the answers should be computed. For instance, one could even imagine using a different algorithm to compute the answers for different source instances. In its general form, this problem has been considered mostly in order to prove theoretical decidability and complexity results, as e.g., in [1], rather than in practical applications. Instead, most of the research in the context of answering queries using views for data integration has followed earlier work on query optimization in focusing on a particular way to solve the query answering problem, namely query reformulation or rewriting: given a query Q over a schema T , a source schema(s) S and a mapping M between them, is there a query Q0 over S that returns all the “relevant” answers of Q? More recently, a different approach has been proposed in the context of data exchange [9]. The data exchange problem is the following: given the mapping M and an instance I of S, is there an instance J of T , so that I, J satisfy M? Then, in terms of query answering, one would be interested to know whether J can be used in order to evaluate queries over T directly over it. Interestingly, the solution that has been proposed for the data exchange problem in [9] employs dependencies as a mapping language, while reformulation algorithms have been proposed for mappings expressed as a containment between views. In this paper we compare these approaches, as ways to solve the problem of answering queries across mappings, and identify their similarities and differences. Moreover, we present some of the solutions that have been proposed for the two problems and analyze the techniques that they employ. In particular, in the case of query reformulation for data integration, a fundamental technique involves the use of inverse rules [1, 8] to produce a logic/datalog program that can be composed with the original query to produce a reformulation over the source schemas. On the other hand, data exchange employs the chase [2], a technique that has also been used for query reformulation in the context of query optimization under materialized views and integrity constraints. However, in the case of data exchange it is applied on database instances rather than on queries. Although the two techniques have been applied to solve different problems, it turns out that there is a tight relationship between them, and, in fact, the inverse rules technique can be extended to compute solutions to the data exchange problem. The rest of this paper is organized as follows: in Section 2 we present some formal definitions related to the problems presented above and illustrate the different mapping languages, as well as specify precisely the set of “relevant” answers in a data integration setting. In Section 3 we analyze the problem of query reformulation and illustrate techniques that have been proposed to solve it, both in the case of query optimization and data integration, as well as generalizations of the problem in the context of P2P data management. In Section 4 we present in more detail the problem of data exchange and a solution that has been proposed, that employs the chase technique and illustrate how data exchange solutions can be used for query answering. Finally, in Section 5 we attempt a comparison between query reformulation and data exchange as solutions to the problem of answering queries across mappings, in terms of the cases in which they are applicable and the results that they produce. Moreover, we compare the underlying techniques that they employ, namely the inverse rules and the chase, and identify a tight relationship between them, as a result of 3

which the inverse rules technique can be used to produce solutions to the data exchange problem.

2

Preliminaries and definitions

Before proceeding to our discussion about query reformulation, data exchange and related techniques, we need to define the problem setting more precisely. Consider, as above, a source schema S and a target schema T . For simplicity of the presentation, and wlog, assume that S ∩ T = ∅. Moreover, consider the joint schema S ∪ T : its instances are pairs of instances hI, Ji s.t. I is an instance of S and J is an instance of T . In this context we can view a mapping M as a (conjunction of) logical assertion(s) over the joint schema S ∪ T . We are going to write hI, Ji |= M if the pair of instances I, J satisfies the mapping M.

2.1

Mapping Languages

As we mentioned above, different languages have been used in the literature to express mappings between schemas. One approach uses logical assertions, in the form of tuple generating dependencies or tgds: ∀x Q1 (x) → ∃y Q2 (x, y) and equality generating dependencies or egds: ∀x Q1 (x) → xi = xj where x, y are vectors of variables, Q1 (x), Q2 (x, y) are conjunctions of relational atoms1 whose free variables are among those in x and x ∪ y, respectively, and xi , xj are two of the variables in x.2 In the case of the data exchange problem, the authors consider only:3 • source-to-target tgds, where Q1 and Q2 are conjunctions of atoms from S and T , respectively • target tgds, in which both Q1 and Q2 only contain relational atoms from T . 1 In this paper we are going to focus on the case that both queries and mappings are (unions of) conjunctive queries, because this is essentially the only setting in which all approaches for query answering presented here are tractable [1]. Decidability and complexity results for more powerful mapping and query languages can also be found in the related literature but this discussion is outside the scope of this paper. 2 Note that the first expression above does not imply that all free variables of Q1 , i.e., x are free variables of Q2 as well, but some of them may be. Moreover, some of y may be empty, i.e., there may be no existentially quantified variables, in which case the dependency is called full 3 An important class of mappings that is not allowed involves target-to-source dependencies. Such mappings are necessary in order to extend the data exchange problem to a P2P setting [12], but they make the problem considerably harder.

4

• target egds, in which both Q1 only contains relational atoms from T . The other approach uses containment of conjunctive queries as a mapping language. Before we proceed to examples of such mappings, we need to formalize the notions of containment and equivalence, as introduced and studied in [4]. Definition 2.1 (Containment and equivalence). A query Q1 is contained in a query Q2 , denoted by Q1 v Q2 if for all database instances D, the set of tuples returned by evaluating Q1 over D, denoted Q1 (D), is a subset of Q2 (D). Two queries Q1 and Q2 are equivalent, iff Q1 v Q2 and Q2 v Q1 . The general form of mappings of this kind is: M : Q1 (x, y) v Q2 (x, z) where Q1 and Q2 are conjunctive queries. Papers that use this kind of mappings also restrict Q1 and Q2 to contain relational atoms from S and T , respectively, or vice versa4 . Suppose Q1 , Q2 are queries over S and T , respectively. Applying the definition of containment above, hI, Ji |= M if QS (I) ⊆ QT (J). In the related literature, this general form of mappings (where both QS and QT are conjunctive queries) is called GLAV [11]. Earlier, more specific cases of interest have been identified, in which either QS or QT was restricted to be a single relational atom (instead of a conjunction of atoms), and the mapping could be thus considered as a view over the other schema. The former case, (i.e., QS being a single atom) is called LAV , which stands for local-as-view, while the latter is called GAV [13], for globalas-view. Note that, in general, GAV, LAV and GLAV mappings are considered in a mediated setting, where several source schemas are mapped to a target schema. However, the general definition introduced above can accomodate this extension, by considering S to be the union of the source schemas. For this reason, in order to simplify our presentation, we are going to focus on the case where there is a single source (S) and a single target (T ) schema. Containment mappings can be further characterized w.r.t. the direction of the containment [17]: • If both QS v QT and QT v QS , one can write Q1 ≡ Q2 and the mapping is called exact. • If QS v QT , the mapping is called sound • If QT v QS , the mapping is called complete. Sound mappings have also appeared in the literature as mappings under the open world assumption (OWA), while exact mappings correspond to the closed world assumption (CWA) [1]. In practice, most papers that use GAV mappings consider them to be exact (CWA) or sound (OWA), while LAV and GLAV mappings are usually sound (OWA), unless otherwise 4

further restrictions apply in the case of GAV and LAV

5

specified. Moreover, in most papers, GAV mappings are specified using datalog rules with the target relations on the head (i.e.: T (x) : −QS (x, y)), and as a result z above is empty, since all variables in the head of a datalog rule must appear in the body. It is straightforward to observe that the general form of containment mappings corresponds to the general form of tgds presented above. As an example, QS (x, y) v QT (x, z) is a sound GLAV mapping that is equivalent to: ∀x, y QS (x, y) → ∃z QT (x, z). Similarly, the exact GAV mapping QS (x, y) ≡ T (x), where T is a relation in T , is equivalent to: ∀x, y QS (x, y) → ∃T (x)∧∀x T (x) → ∃y QS (x, y). Observe that, the source-to-target part of a (sound or exact) GAV mapping is a full tgd. Example 2.2. Consider an application where a source has information about researchers and the projects on which they work (S = {ProjMember(name, project)}) while the target (mediated) schema has pairs of researchers that work on the same project (T = {SameProject(name1,name2,project)}). Then, the exact GAV mapping: SameProject(x,y,p) :- ProjMember(x,p), ProjMember(y,p) expresses the mapping that all and only pairs of researchers whose names appear next to the same project in the source relation ProjMember, will also appear in the same tuple of SameProject. Similarly, LAV mappings are expressed using a datalog-like syntax, with a single source relation in the head (and all variables in the head appearing in the body); In the case of sound LAV, the symbol ⊆ is used instead of : − between the head from the body of the rule. Example 2.3. Suppose the mediated schema contains a relation with authors and their papers (S = {Author(name, paper)}) and some source only contains only some (but not necessarily all) of the pairs of researchers who have coauthored a paper (T = {CoAuthor(name1, name2)}). This can be expressed by the sound LAV mapping: CoAuthor(x,y) ⊆ Author(x,p), Author(y,p)

Finally, sound GLAV mappings also use a similar syntax to LAV but the head of the rule is a conjunction of atoms and it can also contain existential variables. In the rest of this paper, we are going to focus on query reformulation algorithms for sound LAV/GLAV and sound/exact GAV mappings, which are the most commonly used in practice, and compare query reformulation algorithms for such settings with the data exchange approach. Observe that, since the direction of the implication of the corresponding tgd is determined by the direction of the containment, data exchange can only be applied directly in the case of sound mappings, since it does not allow target-to-source dependencies (which correspond to complete and - one direction of - exact mappings). However, we are going to show in Section 5 that data exchange is also possible in the case of exact GAV mappings.

6

2.2

Certain answers

We still have not defined what it is we expect to get as “relevant” answers by evaluating queries over T across the mapping M. The answer would be simple if for every I there was a single J s.t. hI, Ji |= M, namely Q(J). However, this is not the case, and in general there are many Js, instances of T , s.t. hI, Ji |= M. This situation is reminiscent of the problem of querying incomplete databases [23, 16], as it was first observed in [1] for a particular case of mappings. For this reason, the authors of [1] suggested the set of certain answers, which has been adopted by most researchers5 as the desired set of answers to the problem of answering queries across mappings. Definition 2.4 (Certain answers). A tuple t is a certain answer of QT across M if for every J such that hI, Ji |= M, t ∈ QT (J). As a result, for every source instance I, the set of all certain answers of a query Q across a mapping M: ∆

certainM,I (QT ) =

\

QT (J)

J:hI,Ji|=M

3

Query reformulation

The approach of query reformulation as a solution for the problem of query answering across mappings has arised in two different contexts, namely query optimization under materialized views and integrity constraints - and the related problem of providing physical data independence - and data integration, usually under the name of answering queries using views. In the case of query optimization and physical data independence, the input is a query over the database schema and the goal is to rewrite this query into one that uses the materialized views. Query optimization usually requires the rewriting to use some of the views, if this will produce a more efficient query, while physical data independence requires that only materialized views appear in the reformulation. In both cases, the mappings are considered to be exact, since they are views that have been computed from the underlying data. More importantly for our discussion, in the case of query optimization we are only interested in equivalent reformulations. More precisely: Definition 3.1 (Equivalent reformulation). Let QT be a query over the target schema, and M be a mapping on S ∪T , then Q0S (i.e., over schema S) is an equivalent reformulation of Q across M if QT ≡ Q0S As a result of this definition, for every pair of instances hI, Ji of S, T , respectively, s.t., hI, Ji |= M, Q0S (I) = QT (J). One way to compute such equivalent reformulations across mappings - in the form of materialized views and integrity constraints, expressed as logical assertions - has been proposed 5

Although [3] have recently proposed an alternative set of desired answers, for the case of P2P data integration, as explained in Section 3.2

7

in [6, 20], by using the chase technique. In short, this technique works as follows: Let ∀x A(x) → ∃y B(x, y) be such an assertion, where A and B are conjunctions of relational atoms, and Q be the expression in the body of the conjunctive query (usually represented by a tableau [2]). The chase is applicable if there exists a homomorphism h from A to Q, that is the identity on constants and cannot be extended to A ∪ B. Let B[x := h(x)] denote the substitution of variables in the atoms of B by their image under h. Then, the result of the chase step is to add the conjuncts of B[x := h(x)] (as a conjunction) to Q. The chase terminates when no chase step with one of the logical assertions is applicable, in which case the query that has been produced is called a universal query, and is equivalent to the original query. Note that the chase may not terminate in some cases, or it may fail, if after some step the query produced contains an equality between distinct constants6 . In the case of query optimization, the goal is to find the most efficient subquery of the universal query that is equivalent to it. Clearly, there is always a solution to this problem (that may not consist only of view relations), namely the original query or the universal query. However, if only the materialized views are to be used, one has to drop all atoms of the universal query that are not view relations. Then, there is no guarantee that the resulting query will be an equivalent reformulation. Indeed, there are cases in which no equivalent reformulation that uses only view atoms exists, even if the mapping is exact. Suppose, for example, that T = {T1 (a, b), T2 (b, c)} and S = {S(d, e)}, the mapping is ∀xT1 (x, x)↔S(x) and the query is QT (x, y) : − T1 (x, y). Any query over S will have access to data that come from tuples of T1 that have the same value for both attributes, while T1 may also contain tuples with different values. In the case of physical data independence, one would not like to compromise completeness of query answering, and would return a negative answer, in case no equivalent reformulation exists. On the other hand, in data integration, one does not have the alternative of evaluating the original query, since the schema over which the query is issued is virtual. Moreover, sources are often only guaranteed to be sound (i.e., data sources can be incomplete), and this introduces more cases when no equivalent reformulation exists. In such cases, it is still desirable to produce a reformulation that is contained in the original query. Definition 3.2 (Contained Reformulation). Let QT be a query over the target schema, and M be a mapping on S ∪ T , then Q0S (i.e., over schema S) is a contained reformulation of Q across M if Q0S v QT In terms of pairs of instances, this means that for every I, J such that hI, Ji |= M, Q0S (I) ⊆ QT (J). In the sequel we are going to write M |= Q1 v Q2 to denote that Q1 is a contained reformulation of Q2 across M. Since all these contained reformulations return certain answers, it has been proposed that query reformulation should return the maximally-contained reformulation, i.e., a contained reformulation Q0S as defined above, with the additional property that for every other contained reformulation Q00S expressed in the same query language as Q0S , Q00S v Q0S . In other words, for every I, J s.t. hI, Ji |= M, ax is such that for every other contained reforthe maximally-contained reformulation QM S 6

but this is only possible when chasing with equality generating dependencies

8

ax (I) and, of course, QM (I) ⊆ Q (J). Consider now the query: mulation Q0S , Q0S (I) ⊆ QM T S S ∆

ref ormM (QT ) =

[

Q0S

Q0S :M|=Q0S vQT

i.e., the union of all contained reformulations. Clearly, ref ormM (Q) is a reformulation of Q. Moreover, since it is a union of queries, each tuple it returns when applied on an instance I is in the result of at least one of those queries. As a result, ref ormM (Q) is also a contained reformulation. Conversely, every contained reformulation is also contained in ref ormM (Q), by its definition as the union of all contained reformulations. Thus, ref ormM (Q) is in fact a maximally-contained reformulation7 . It was first observed in [1] for a particular case of LAV mappings, where the mapping language was containment of conjunctive queries and the query language was datalog, that maximally-contained reformulations compute certain answers. It turns out, as it was also observed in [9] and more precisely in [22], that this is a more general result, i.e., regardless of the expressiveness of the mapping language. Consider the following query, that returns all certain answers of a query Q over T across M: ∆

certainM (Q) = λI.certainM,I (Q) Proposition 3.3 ([22]). certainM (Q) ≡ ref ormM (Q) Proof. (certainM (Q) v ref ormM (Q)): For every t ∈ certainM (Q, I), by the definition of certain answers, for every J : hI, Ji |= M, t ∈ Q(J). As a result, certainM (Q) is a contained reformulation, i.e.: M |= certainM (Q) v Q. Since ref ormM (Q) is a maximally-contained reformulation of Q across M, it follows that certainM (Q) v ref ormM (Q). (certainM (Q) w ref ormM (Q)): It suffices to show that for every contained reformulation Q0 , Q0 v certainM (Q). Indeed, if Q0 is a contained reformulation, by definition, Q0 (I) ⊆ Q(J), for every I, J s.t. hI, Ji |= M. In other T words, for a given I, for every J s.t. hI, Ji |= M, Q0 (I) ⊆ Q(J), and as a result, Q0 (I) ⊆ J:hI,Ji|=M Q(J) = certainM (Q, I). Since ref ormM (Q) is itself a contained reformulation, it follows that certainM (Q) w ref ormM (Q).

3.1

Reformulation algorithms for data integration

We already presented a technique for producing equivalent reformulations for query optimization. Unfortunately, when no equivalent reformulations exist (as in data integration) this method does not produce maximally-contained reformulations. Instead, if it terminates, 7 Note that this is in fact the basis of the bucket algorithm [18]: First it identifies potentially contained reformulations, by combining “relevant” atoms, then it checks which ones are indeed contained and finally it takes their union

9

it computes minimally-containing [5] ones, by dropping all atoms that are not view relations from the universal query. For this reason, several query reformulation algorithms that produce maximally-contained reformulations have been proposed in the context of data integration. As mentioned above, we are only going to focus on sound/exact GAV and sound LAV/GLAV mappings, since these are the most common cases that arise in data integration systems. The problem is easier in the case of GAV mappings: for every relation T in T there is one or more mappings of the form T v QS . This can also be viewed as a datalog rule: T (x) : − QS (x, y), where y are existential variables and QS is a conjunctive query. Essentially, these rules with T in their head completely determine the contents of T w.r.t. the instance I of S. Then, for query reformulation we only have to replace every relational atom in the query with its “definition” in terms of S.8 Query reformulation is harder in the case of (sound) LAV mappings (perhaps surprisingly, it is no harder in GLAV than in LAV, as we are going to show later). Intuitively, the problem is that the atoms that occur in the query appear in the body of the mappings, viewed as datalog rules, instead of the head, within a conjunction of atoms that also contains existentially quantified variables. As a result, these rules cannot be used directly as in the case of GAV mappings. For this reason, [1, 8] proposed the method of inverse rules: • For every rule r, for every conjunct of the body of r, a new rule r0 is created, whose head is this conjunct and whose body is the head of r. • Since the conjuncts of a rule may contain existentially quantified variables, which cannot appear in the head of a rule, we have to skolemize the rules before inverting them as above. To do this, we replace all existential variables in the body of the rule with function symbols (we need a distinct function symbol for every rule and every existential variable of a rule), whose parameters are the variables that appear in the head of the rule. More precisely, for every rule r of the form: S(x) : − Q(x, y) where y stands for y1 , . . . , yn and Q is of the form T1 (x, y), . . . , Tm (x, y), let fr,yi be a skolem function. Then, the inverse rules program of r is a set of rules, containing on rule for each j ∈ [1 . . . m], of the form: Tj (x, fr,y1 (x), . . . , fr,yn (x)) : − S(x) Given a set of LAV mappings (views) V, we denote by V −1 the set of inverse rules of V. Let (Q, V −1 ) denote the logic program composed by the conjunctive query Q (expressed as a datalog program) and the rules in V −1 . Since the inverse rules have skolem functions in 8 If the mapping is sound, this definition may be a union of conjunctive queries, in which case the substitution step above produces a union of conjunctive queries.

10

their head, it is conceivable that the result of evaluating (Q, V −1 ) over an instance I would produce some tuples with function symbols. Let (Q, V −1 )↓ be a modified logic program, that evaluates (Q, V −1 ) and then discards all tuples with function symbols. Then, the authors of [8] proved the following: Theorem 3.4 ([8]). For every datalog program Q and every set of conjunctive views V, (Q, V −1 )↓ is a maximally-contained reformulation of Q (for the language of logic programs). Other algorithms have also been proposed for the case of LAV mappings. The simplest one, called the bucket algorithm ([18]), works as follows: (a) First, it identifies possible contained reformulations, by taking all combinations of the heads of rules in which atoms from the query appear, (b) Then, it checks which of the queries produced are indeed contained reformulations, and, (c) Finally, it takes the union of all those contained reformulations. Minicon ([21]) is an improvement of the bucket algorithm, that identifies some additional conditions for phase (a), which have the effect that all reformulations that are produced are contained, and no containment check is necessary. However, the inverse rules technique is the most general, in that it can also be extended to handle GLAV mappings, as explained in [11], as well as LAV mappings extended to allow dependencies on the target schema (as the ones introduced in the problem of data exchange, presented below). In fact, there is a tight relationship between the inverse rules method and the chase, as used in the context of data exchange, and as a result it can also be extended to compute solutions to that problem, as we are going to illustrate in Section 5.1.

3.2

Extensions to P2P data management

Before switching to the presentation of the data exchange problem, it is worth mentioning the generalization of the query reformulation problem in the context of P2P systems. Indeed, until now, we have considered a mediated setting in which there is one source (usually formed by the union of several sources) and one target schema. However, there are applications that require a more flexible architecture, in which every peer has a schema, related through some mapping to one or more of the other peer schemas. The important difference in this case is that: a) there is no single target (mediated) schema over which all queries are issued; in fact, there are several virtual (peer) schemas, and the same peer schema can be viewed as both the source of some mappings and the target of others and, b) in general, there are no direct mappings between every pair of source and (possible) target schemas. However, there may be indirect mappings between some pairs, formed by “paths” of mappings. As a result, in order to reformulate a query over a peer schema to queries over schemas of actual data sources, one may need to produce intermediate reformulations over schemas along the path of mappings from the target to the source. 11

A framework for the definition of such systems, called Peer Data Management Systems (PDMS), is introduced in Piazza ([14]), using a language called PPL. In short, the problem setting is the following: • There is a set of source schemas, S1 , . . . , Sn and a set of instances of those schemas I1 , . . . , In , respectively. • There is a set of peer schemas, P1 , . . . , Pm . • For every source schema Si there is a mapping Mi that relates it with a (single) peer schema Pj . These mappings are (sound or exact) LAV mappings (i.e., S v QPj ). • There is a set of peer mappings MP , whose members are mappings between pairs of peer schemas Pi , Pj . These mappings are either exact (expressed using datalog syntax, similar to GAV mappings), sound (expressed using set inclusion syntax, similar to LAV mappings) or definitional mappings. The last are a simple form of datalog rules, whose head and body are both peer relations, i.e., not queries, and are used to allow a limited form of equality, that does not complicate query answering. Let M = MP ∧

V

i Mi ,

then J1 , . . . , Jm are solutions for the peers 1, . . . , m if: hI1 , . . . , In , J1 , . . . , Jm i |= M

Then, the certain answers of a query Qk (over schema Pk ) are: ∆

\

certainM,I1 ,...,In (Qk ) =

Qk (Jkl )

Jkl :hI1 ,...,In ,J1 ,...,Jkl ,...,Jm i|=M

One can visualize this setting by means of a mapping graph. The nodes of this graph are peer relations. Then, there is a (directed) edge from node R1 to node R2 if there is a mapping that corresponds to a dependency ∀x Q1 (x) → ∃y Q2 (x, y), where R1 appears in Q1 and R2 appears in Q2 . Example 3.5. Consider the following source and peer schemas: • S1 = { S1(name,project,area)} • S2 = { S2(name1,name2)} • P1 = { ProjMember(name,project),Area(project,area)} • P2 = { SameProject(name1,name2,project),Author(name,paper)} and the mappings: • r0 : SameProject(n1,n2,p) : − ProjMember(n1,p), ProjMember(n2,p) 12

P1

P2 ProjMember

SameProject

Area

Author

S1

S2

S1

S2

Figure 1: The mapping graph of a simple PDMS • r1 : S1(n,p,a) ⊆ ProjMember(n1,p), Area(p,a) • r2 : S2(n1,n2) ⊆ Author(n1,p), Author(n2,p) The mapping graph for this example is shown in Figure 1. Observe that the mapping graph may, in general, have cycles. As a simple example, the two dependencies that correspond to an exact (equality) mapping introduce a cycle. Unfortunately, the existence of cycles makes query answering undecidable: Theorem 3.6 ([14]). 1. The problem of finding all certain answers to a conjunctive query Q, for a PDMS N specified in PPL is undecidable 2. If the graph formed by the peer mappings is acyclic, a conjunctive query can be answered in polynomial time. 3. The problem is still in polynomial time if: (a) equality mappings do not contain projections and (b) a peer relation that appears on the head of a definitional mapping does not appear on the right-hand side (in the datalog-like syntax) of any other mapping. As a result of the theorem above, one cannot hope to find a complete query reformulation algorithm that will work for any PDMS. However, such a complete algorithm can be found for the cases identified in parts 2,3. For this reason, the authors of [14] propose a sound (but not complete) query reformulation algorithm, that computes all certain answers, when the mapping graph is acyclic, or only contains cycles of the form as described in cases 2,3 above, while it computes a subset of certain answers, if cycles exist. The following is a simplified description of the algorithm: Piazza algorithm

13

• The reformulation of a query that is a conjunction of atoms is the conjunction (join) of the reformulations of its conjuncts. • For every conjunct R(x) (also called a subgoal of the input query Q), identify all mappings that are adjacent to the relation R in the mapping graph. The reformulation of this conjunct is equal to the union of all reformulations that are produced by these mappings. • For every mapping r adjacent to R, there are two cases: 1. If r is a GAV-like mapping (datalog syntax), it is only usable for reformulation if R is the single atom in the head of r. In this case, reformulation proceeds as in the case of GAV reformulation: the R atom in the query is replaced by the right-hand side (rhs) of R. Since these atoms also contain variables, one needs to find a homomorphism between the variables in the head of the rule and in the query atom before this substitution is possible. Moreover, since the rhs of r may introduce some new variables (the existentially quantified ones), one should make sure that they are substituted by fresh ones (i.e., that have not appeared in the reformulation up to this step). 2. If r is a (sound) LAV-like mapping (using set inclusion syntax), it is only usable if the relation symbol of the query atom appears in the rhs of r. This step is treated as a LAV reformulation step, e.g., by computing the inverse rules of r and unifying them with the query atom under consideration, i.e., finding a homomorphism from the variables of the head of the relevant inverse rules (that have R on their head) to the variables in the query atom. Then one can substitute the query atom with the relevant inverse rule(s), in which the variables have been substituted by their image under this homomorphism. Note that, even though the inverse rules can have skolem terms as variables of the atom in their head, this homomorphism will map them to some of the variables already in the query atom, and thus no skolem terms appear in the reformulation. If more than one inverse rule is applicable, the result is the union of the reformulations according to each of the rules. • It should be clear that the procedure explained above produces a tree, whose nodes are relational atoms, and whose edges correspond to reformulation steps. This tree is called a rule-goal tree. The procedure above is repeated recursively on the new nodes that are produced at every step until all leaf nodes contain source relation symbols. This does not compromise completeness, since every source relation is mapped to a single peer schema, and thus there are no “outgoing” mappings from source schemas that could produce further reformulations. • In order to guarantee termination, the algorithm disregards mappings that have already been used at earlier steps of the reformulation (along the path from the root of the rule-goal tree to the currently considered node), when considering adjacent mappings. This guarantees termination of the algorithm, even in the presence of cycles in the mapping graph, while it does not compromise its completeness, when the map-

14

Q(n1,n2)

q

r0

ProjMember(n1,p)

ir1

S1(n1,p,_)

ProjMember(n1,p)

Author(n2,w)

Author(n1,w)

SameProject(n1,n2,p)

ir2a

ir2b

ir2a

ir2b

S2(n1,n2)

S2(n2,n1)

S2(n2,n1)

S2(n1,n2)

ir1

S1(n2,p,_)

Figure 2: Reformulation rule-goal tree of a query over the example PDMS of Figure 1 pings are acyclic. However, it is the source of the incompleteness of the algorithm, in the presence of cycles in the mapping graph. Example 3.7. Figure 2 shows the rule-goal tree that is produced for the reformulation of the query Q(n1, n2) : −SameP roject(n1, n2, p), Author(n1, w), Author(n2, w) (over schema P2 ). To apply the algorithm above for the case of the LAV mappings r1 , r2 we need to define their inverse rules: • ir1 : ProjMember(n1,p) : − S1(n,p,a) • ir2 a : Author(n1,f (n1,n2) : − S2(n1,n2) • ir2 b : Author(n2,f (n1,n2) : − S2(n1,n2) An arc between the children of a node denotes a conjunction (join), while a lack of an arc denotes a union. The reformulation of Q is then: Q0 (n1, n2) ≡ S1(n1, p, )∧S1(n2, p, )∧(S2(n1, n2) ∪ S2(n2, n1))∧(S2(n2, n1) ∪ S2(n1, n2)) ≡ (S1(n1, p, )∧S1(n2, p, )∧S2(n1, n2)) ∪ (S1(n1, p, )∧S1(n2, p, )∧S2(n2, n1)) An alternative approach to deal with undecidability in the presence of cycles is proposed in [3]. In that paper, the authors propose a different interpretation of the mappings, based on epistemic logic, as opposed to the first-order interpretation that we have considered thus far. The essential difference, in terms of query answering, is that the set of certain answers under the epistemic interpretation, denoted as certainK M,I , (where M is the set of all mappings and I is an instance of the joint schema of all data sources) is a subset of the set certainM,I of certain answers, as defined above. On the other hand, computing this set of answers is decidable, even in the presence of cycles in the mapping graph, and the authors propose a query answering algorithm, that computes the set of certain answers 15

under the epistemic interpretation. However, it is not clear how certainK M,I relates to the set of answers computed by the Piazza algorithm in the case of mapping graphs with cycles, or whether there is any intuitive reason why one would prefer certainK M,I over certainM,I , as the set of “relevant” answers in a PDMS. A completely different approach proposes to compute direct mappings between every pair of peer schemas by composing the mappings [19, 10] along a path in the mapping graph. A more detailed analysis is outside the scope of this paper.

4

Data Exchange

Data exchange [9] proposes an alternative approach to query reformulation, for answering queries across mappings. Instead of computing a reformulation for each query over the target schema, so that it can be evaluated on the real data sources, data exchange proposes to materialize an instance of the target schema, so that all such queries can be evaluated directly on it. Observe that data exchange is not directly comparable to query reformulation or query answering as a problem, since its purpose is to compute a materialized instance of the target schema, that, together with the source instance, satisfies the mapping. However, it can be considered as solution to the query answering problem, since one can use a solution of the data exchange problem in order to evaluate queries over the target schema, as it was shown in [9]. As in the case of query reformulation, the set of answers that we expect to get for every query QT , given a mapping M and an instance I of T , is certainM,I (QT ), the certain answers. Papers on data exchange consider only tgds from S to T , which in general correspond to sound GLAV mappings. Moreover, the data exchange setting allows to specify target dependencies (i.e. tgds and egds from T to T ). For reasons of simplicity, we are going to limit our presentation on data exchange settings with source to target dependencies only. However, most of the results presented below for data exchange and its comparison with query reformulation extend to the general case. The main reason for this is that, as it was shown in [7], the inverse rules technique can be extended to handle target dependencies. In general, given a mapping M and a source instance I there may be several possible solutions to the data exchange problem, i.e., instances J of T s.t., hI, Ji |= M. Consider for example the following situation: Example 4.1. Let S = {S1 (a, b), S2 (c, d)}, T = {T (e, f)}, M = {d1 : ∀x, y S1 (x, y) → ∃z T (x, z), d2 : ∀x, y S2 (x, y) → ∃z T (z, y)} and I = {S1 (a0 , b1 ), S2 (a1 , b0 )} Observe that the tgds in M do not completely specify the target instance. For example, d1 requires that for every constant value that appears as value of attribute a of a tuple t ∈ S1 , there is a tuple t0 ∈ T with the same value for attribute e. However, it does not specify the value of attribute f of t0 , or whether there may be multiple tuples in T with the same value for attribute e and multiple different values for attribute f. In order to express this incomplete information, one can use special values, called labeled nulls, which are essentially variables (as opposed to “normal” values, i.e., constants). For example, one target instance for the 16

setting presented above would be: J = {T (a0 , Z0 ), T (Z1 , b0 )}, where Z0 , Z1 are labeled nulls. Other solutions would be J 0 = {T (a0 , b0 )} and J 00 = {T (a0 , Z0 ), T (Z1 , b0 ), T (Z2 , b0 )}.9 Among the possible solutions, the authors of [9] identified a special class of solutions, called universal, that turn out to be appropriate for query answering. Before we introduce the definition of universal solution, we need to clarify the notion of a homomorphism between joint “instance” that can contain labeled nulls: Definition 4.2 (Homomorphism). Let R be a schema, C be a finite set of constants (the domain of R) and V be an infinite set of variables (i.e., the labeled nulls). Moreover, for an instance K of R, let V(K) be the finite subset of V, whose contents are the variables that appear in K. Let K1 and K2 be two instances over R, with values in C ∪ V. 1. ([9]) A homomorphism h : K1 → K2 is a mapping from C ∪ V(K1 ) to C ∪ V(K2 ) that is the identity on constants and for every tuple t and every relation Ri ∈ R s.t. Ri (t) ∈ K1 , Ri (h(t)) ∈ K2 (where if t = (t1 , . . . , tn ), then h(t) = (h(t1 ), . . . , h(tn ))). 2. ([9]) K1 is homomorphically equivalent to K2 if there is a homomorphism h : K1 → K2 and h0 : K2 → K1 3. K1 is isomorphic to K2 if there is a bijection h that is a homomorphism from K1 to K2 and its inverse h−1 is a homomorphism from K2 → K1 . Definition 4.3 (Universal solution). If M is a mapping and I is a source instance then a universal solution for I is an instance J s.t. for every solution J 0 there is a homomorphism h : J → J 0. Example 4.4. To return to our example above, the solution J turns out to be universal, while J 0 and J 00 are not. For example, the homomorphism h : J → J 0 is the identity in constants plus {h(Z0 ) = b0 , h(Z1 ) = a0 }. Notice that h is also a homomorphism J → J 00 (although it does not map any tuple to T (Z2 , b0 ), since it does not have to be surjective). Universal solutions turn out to have several useful properties, such as that all universal solutions for a source instance I are homomorphically equivalent. Moreover, if the universal solutions J, J 0 of two source instances I, I 0 are homomorphically equivalent, the set of all solutions for I and I 0 are equal. Last, but not least, a universal solution can be computed by applying the chase on the joint instance hI, Ji, instead of on a query, as in the case of equivalent reformulations for query optimization. More precisely, applying the chase on a (joint) instance K proceeds as follows: • if d : ∀x φ(x) → ∃y ψ(x, y) is a tgd, the chase with d is applicable if there exists a homomorphism h from φ(x) to K that cannot be extended to a homomorphism h0 from φ(x)∧ψ(x, y) to K 9

Observe that the target instance may only contain constants that appear either in I or in the dependencies of M (for reasons of simplicity of presentation we have not shown any such tgds but the results extend to them, too), as well as labeled nulls. Since there is an infinite number of labeled nulls, there can be an infinite number of solutions (and this is indeed usually the case).

17

• If the chase with d is applicable, let K 0 be the union of K with the facts produced by: (a) extending h to h0 such that each variable in y is assigned a fresh labeled null (different for each yi ), and (b) taking the image of the atoms of ψ under h0 . The algorithm that computes the universal solution starts with a joint instance hI, ∅i, where I is a source instance and ∅ is a target instance, and applies the chase to it. In the case that we are considering here, where the mapping consists only of source-to-target dependencies, applying the chase can only introduce relational atoms from T . As a result, for every sequence of chase steps, the applicability of a step does not depend on any of the previous steps, since the left-hand side of all tgds consists only of relational atoms from S and no such atoms are introduced by the chase. Thus, all chase steps that can be performed along any chase sequence are applicable on the original joint instance hI, ∅i. Moreover, since there are no egds and no constants in the tgds, the chase cannot fail. As a result, when considering only source-to-target dependencies, the chase always terminates and computes a universal solution, which is also unique in this setting. In the general case, including target tgds and egds, the chase may also fail or diverge. As it was proved in [9], if the chase terminates, its result is a universal solution, while if there exists some failing finite chase starting with hI, ∅i, then there is no universal solution. Moreover, they identified a sufficient condition for the mapping to guarantee termination of the chase, namely that it is a union of a weakly acyclic10 set of tgds with a set of egds. Finally, observe that in the general case the chase is not deterministic, i.e., considering the dependencies in the mapping in different order may produce different terminating chase sequences. As a result, the universal solution that is computed by the chase is not unique. In terms of complexity of the computations, it was proved in [9] that the length of every chase sequence is polynomial in the size of the source instance I. As a result, computing a universal solution using the chase can be done in polynomial time; note that this is in the size of the instance I, which is usually large. As we have discussed earlier, the reason for materializing a target instance is in order to be able to answer queries over the target instance directly. It turns out that not all solutions J s.t. hI, Ji |= M are appropriate for this. Fortunately, as it was proved in [9], universal solutions can be used to compute the certain answers to a query over T ; moreover, they are the only ones for which this is true for every conjunctive query. To state the theorem, we first need some terminology, similar to one introduced in Section 3.1: Definition 4.5. If Q is a k-ary query and J is a target instance, let Q(J)↓ denote the set of all k-tuples t of constants (i.e., not including labeled nulls) such that t ∈ Q(J) Intuitively, Q(J)↓ is the result of evaluating the result of the query Q over J and discarding all tuples that contain labeled nulls. Then, the following theorem justifies the selection of universal solutions as appropriate for query answering: Theorem 4.6. [9] 10 this notion refers to a different kind of mapping graph and is not to be confused with acyclicity of mappings in the context of PDMSs. A more detailed analysis of weak acyclicity is outside the scope of this paper, but can be found in [5, 9]

18

1. Let Q be a union of conjunctive queries over target schema T . If I is a source instance and J is a universal solution, then certainM,I (Q) = Q(J)↓ 2. Let I be a source instance and J be a solution such that for every conjunctive query Q over T , we have certainM,I (Q) = Q(J)↓ . Then J is a universal solution.

5

Comparison

From a practical perspective, although query reformulation has been considered as the de facto solution for data integration until the last couple of years (and there are already several data integration systems that follow this approach), there are several situations in which one might prefer - or, in fact, need - data exchange, i.e., to materialize a target instance. For example, in query-intensive applications, such as data warehousing, it may be more efficient to compute a target instance and then evaluate queries locally. A similar situation arises in a setting where the computer that serves the target schema is not connected to the network or to the other data sources all the time, and we still want to be able to evaluate queries even when it is offline. Observe that data exchange also simplifies query evaluation, since one can use a traditional database system.11 On the other hand, query reformulation returns a distributed query plan (across several peers, in a P2P setting) and query evaluation may not be trivial or very efficient - although there is room for optimization using, e.g., adaptive query processing. On the other hand, data exchange involves a large “startup” cost, for materializing the target instance, which may not be worth paying if the volume of queries is relatively small. This cost may be even more significant if updates at the sources are common and the materialized instance needs to be recomputed often (although one could imagine some sort of incremental propagation of updates to the materializing instance). In general, query reformulation is more appropriate for integrating autonomous data sources, since queries are evaluated directly by the sources and there is no need to handle updates. Moreover, query reformulation has been applied to more general settings, as P2P data integration, while the first steps for the extension of data exchange to such settings have just been proposed in [12]. As we already mentioned, in terms of query answering, both query reformulation and data exchange compute the set of certain answers. In the following section we are going to show that there is a more fundamental relationship between the techniques employed by the two approaches, namely the inverse rules and the chase, and as a result one can in fact use the inverse rules technique to compute a universal solution. Moreover, we are going to illustrate how data exchange can be applied in common reformulation settings, namely sound/exact GAV and sound LAV/GLAV. 11

with slight extensions to handle labeled nulls

19

5.1

Using inverse rules for data exchange

A slight adjustment of the inverse rules algorithm can be used to compute universal solutions in a data exchange setting. For every relation Ti (xi ) of the target schema T consider the query: QT (xi ) : − Ti (xi ) Moreover, if I is an instance of S, for every relation S in S and every tuple t in S I create a rule (fact): S(t) : − For every relation Ti , let PM,I,Ti be the logic program formed by the union of inverse rules of the mapping with the facts and the corresponding query defined above. For reasons of simplicity of the presentation, and wlog, we are going to assume that T contains a single relation T , and we will write PM,I instead of PM,I,T . Then, the following proposition asserts that the result of the execution of PM,I , where tuples with function symbols are not discarded as before, is isomorphic with chaseM (I). solution. Proposition 5.1. The instances produced by the chase on hI, ∅i and the logic program PM,I described above are isomorphic. Proof. We are only going to prove this result for the case of source-to-target dependencies, although it is likely that the result also holds in the general case. As we mentioned above, in this case the chase is deterministic, i.e. it produces a unique universal solution, denoted chaseM (I). Since all steps of any chase sequence are applicable on the original instance hI, ∅i, these steps can be freely interchanged and still create a chase sequence that produces the same result. Observe that PM,I is also deterministic in the same way, as there are no rules for which a relation that appears in the body of a rule also appears in the head of another (since S ∩ T = ∅ ). d

Let C denote a chase sequence starting from hI, ∅i, formed by chase steps ci : hI, Jk i → hI, Jk+1 i. Also, let hci denote the homomorphism by which ci was S applied, and out(ci ) = Jk+1 − Jk (i.e., the facts introduced by ci ). Then chaseM (I) = i out(ci ). In the case of PM,I , consider its execution S, using SLD resolution [2]; since PM,I has the particular form identified above, all SLD derivations sj of facts that are produced by the program consist of a single unification step, denoted θsj , that matches the variables in the body of some inverse rules (that have the same body) to constants, that appear in facts, while it is a homomorphism w.r.t. the relations. Let out(sj ) denote the facts that are produced by this S derivation, and out(PM,I ) = j out(sj ), the facts produced by the execution of PM,I . Then, the result above follows from the following lemma: Lemma 5.2. There is a bijection f : C → S, s.t. f (ci ) = sj iff hci = θsj . Moreover, out(f (ci )) is isomorphic with out(ci ). Proof. Let d: ∀x QS (x) → ∃y QT (x, y) be the dependency that was used for ci . For the chase to be applicable, there must exist a homomorphism hci : QS (x) → I, that cannot be 20

extended to QS (x) ∪ QT (x, y). hci is the identity on constants (if any), and assigns values from the domain of I to the variables x, so that I |= QS (hci (x)), i.e., for every conjunct Sk (x) of QS there is a tuple h(x) ∈ SkI . By the construction of PM,I , for each such tuple there is a fact in PM,I : SkI (hci (x)) : − (1) When the chase is applied, the extension h0ci of hci maps every variable in y to a fresh labeled null, i.e., one that is unique for every dependency d, homomorphism hci and variable yl in y. Let Z(d, hci , yl ) denote each such labeled null, then, for every conjunct Ti (x, y) in QT (x, y), the facts out(ci ) are of the form: Ti (hci (x), Z(d, hci , y1 ), . . . , Z(d, hci , yn ))

(2)

Consider now the inverse rule that is produced from d; for every conjunct Ti (x, y) of QT , there is an inverse rule: Ti (x, fd,y1 (x), . . . , fd,yn (x)) : − QS (x)

(3)

It should be clear that if there exists a homomorphism hci : QS (x) → I as above, then hci is also a unification between the inverse rules of d and the facts identified in (1) above, and vice versa. Let sj be the SLD derivation that is produced by the unification hci , then there is clearly a bijection f s.t. f (ci ) = sj . Applying the unification to each rule of the form of (3), we derive facts out(sj ) of the form: Ti (h(x), fd,y1 (hci (x)), . . . , fd,yn (hci (x))(4) Clearly, there is a bijection g : out(ci ) → out(f (ci )), that is a homomorphism w.r.t. the relations of T s.t. g(Z(d, hci , y) = fd,y1 (hci (x))), i.e. out(ci ) is isomorphic to out(f (ci ))

By the lemma above, and by induction on the length of a chase sequence C that produces a universal solution, chaseM (I) is isomorphic to out(PM,I ).

Corollary 5.3. out(PM,I ) is a universal solution Proof. Let g be the isomorphism from out(PM,I ) to chaseM (I). Since chaseM (I) is a universal solution, for every other solution J there is a homomorphism hJ from chaseM (I) to J. Then, for every solution J, hJ ◦ g is a homomorphism from out(PM,I ) to J.

21

5.2

Using data exchange to simulate GAV, LAV, GLAV

In this section we are going to focus only on the most popular settings of answering queries using views, namely exact/sound GAV, sound LAV and sound GLAV. For the former, the mapping is specified by a union of rules (possibly with the same head) of the form: Tj (x) : − QSi (x, y), where Tj ∈ T (the mediated S schema) andSSi is the schema of source i. Assume wlog, that Si s are disjoint and let S = i Si and I = i Ii , where Ii is the instance of Si . Let VS denote the union of the rules (mappings) between all Si s and T . In terms of instances, such an exact GAV mapping M implies that hI, Ji |= M iff J = VS (I). Query reformulation is simple in this case: given QT its reformulation over source schemas is given by QT ◦ VS . Data exchange cannot be applied directly, because the mapping is equivalent to two tgds, one from S to T and another from T to S, and the latter cannot be captured in the data exchange framework. However, the target solution is completely specified by the mapping, i.e. there is a single solution, namely J = VS (I), which is trivially universal. Sound GAV is slightly more interesting: in this case, hI, Ji |= M iff VS (I) ⊆ J. Query reformulation again amounts to the composition QT ◦VS . Data exchange can now be applied directly, since the mapping corresponds to a source-to-target tgd. However, it is easy to see that VS (I) is again a universal solution, since the identity mapping is a homomorphism from it to every other solution. Observe that the view definition can be expressed as a set M of tgds (one for each of the rules) from S to T : ∀x, y QSi (x, y) → T (x). Then, chaseM (I) = VS (I). Query reformulation for sound LAV and GLAV is more involved. Since both can be solved using the inverse rules technique explained above, we are going to treat them uniformly. In both cases, the mapping is a set of containments of conjunctive queries: QSi (x) v QT (x, y) (in the case of LAV the head of these S rules is a single relational atom of Si ). Assume, wlog, that Si s are disjoint and let S = i Si . As we have already observed in Section 2.1, these views are equivalent to a set M of tgds from S some Si to T , and thus from S to T , of the form: ∀x QSi (x) → ∃ySQT (x, y). Let I =S i Ii , where Ii is the instance of data source i. Let J = chaseM (I) = i chaseMi (Ii ) = i Ji , where Mi is the set of tgds from Si to T . Since conjunctive queries are monotone, for every conjunctive query QT : [ [ QT (J)↓ = QT (Ji )↓ = certain(QT , Ii ) = certain(QT , I) i

i

By theorem 4.6-2, this holds only for universal solutions, i.e., J is a universal solution for the target schema T .

6

Conclusions

In this paper we compared two solutions to the problem of answering queries across mappings in the context of integration of heterogeneous information, namely query reformulation and data exchange. We illustrated that they both produce the same solutions to the problem of query answering, namely the certain answers. We also showed that the techniques that 22

are employed in both cases are closely related and outlined how query reformulation techniques could be used to solve the data exchange problem, as well as how the data exchange approach could be applied in common mediated GAV and LAV settings.

References [1] S. Abiteboul and O. Duschka. Complexity of answering queries using materialized views. In Proceedings of ACM Symposium on Principles of Database Systems, Seattle, WA, USA, 1998. [2] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [3] D. Calvanese, G. D. Giacomo, M. Lenzerini, and R. Rosati. Logical foundations of peer-to-peer data integration. In Proceedings of ACM Symposium on Principles of Database Systems, pages 241–251, 2004. [4] A. K. Chandra and P. M. Merlin. Optimal implementation of conjunctive queries in relational data bases. In STOC ’77: Proceedings of the ninth annual ACM symposium on Theory of computing, pages 77–90, New York, NY, USA, 1977. ACM Press. [5] A. Deutsch, B. Lud¨ascher, and A. Nash. Rewriting queries using views with access patterns under integrity constraints. In Proceedings of International Conference on Database Theory (ICDT), pages 352–367, 2005. [6] A. Deutsch, L. Popa, and V. Tannen. Physical data independence, constraints, and optimization with universal plans. In Proceedings of International Conference on Very Large Databases (VLDB), pages 459–470, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [7] O. Duschka. Query planning and optimization in information integration. Computer Science Technical Report, PhD Thesis STAN-CS-TR-97-1598846, Stanford University, 1997. [8] O. M. Duschka and M. R. Genesereth. Answering recursive queries using views. In Proceedings of ACM Symposium on Principles of Database Systems, pages 109–116, 1997. [9] R. Fagin, P. Kolaitis, R. Miller, and L. Popa. Data exchange: Semantics and query answering. Theoretical Computer Science, 2005. to appear. [10] R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan. Composing schema mappings: Second-order dependencies to the rescue. In PODS, pages 83–94, 2004. [11] M. Friedman, A. Levy, and T. Millstein. Navigational plans for data integration. In AAAI ’99/IAAI ’99: Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence, pages 67–73, Menlo Park, CA, USA, 1999. American Association for Artificial Intelligence. 23

[12] A. Fuxman, P. G. Kolaitis, R. J. Miller, and W. C. Tan. Peer data exchange. In PODS, 2005. [13] H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2):117–132, 1997. [14] A. Halevy, Z. Ives, D. Suciu, and I. Tatarinov. Schema mediation for large-scale semantic data sharing. The VLDB Journal, 2005. to appear. [15] A. Y. Halevy. Answering queries using views: A survey. The VLDB Journal, 10(4):270– 294, 2001. [16] T. Imielinski and W. Lipski. Incomplete information in relational databases. Journal of the ACM, 31(4):761–791, 1984. [17] M. Lenzerini. Data integration: a theoretical perspective. In Proceedings of ACM Symposium on Principles of Database Systems, pages 233–246, New York, NY, USA, 2002. ACM Press. [18] A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. In VLDB ’96: Proceedings of the 22th International Conference on Very Large Data Bases, pages 251–262, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. [19] J. Madhavan and A. Y. Halevy. Composing mappings among data sources. In VLDB, pages 572–583, 2003. [20] L. Popa, A. Deutsch, A. Sahuguet, and V. Tannen. A chase too far? In Proceedings of ACM SIGMOD Conf. on Management of Data, pages 273–284, New York, NY, USA, 2000. ACM Press. [21] R. Pottinger and A. Halevy. Minicon: A scalable algorithm for answering queries using views. The VLDB Journal, 10(2-3):182–198, 2001. [22] V. Tannen. Queries across mappings: reformulation or exchange? Manuscript, 2005.

Unpublished

[23] R. van der Meyden. Recursively indefinite databases. In Proceedings of International Conference on Database Theory (ICDT), pages 364–378, 1990.

24