On the First-order Expressibility of Computing Certain ...

1 downloads 0 Views 190KB Size Report
Jun 6, 2010 - low complexity class AC0). For queries q in the class of conjunctive queries without self-join, we provide a necessary syntactic condition for first-.
On the First-order Expressibility of Computing Certain Answers to Conjunctive Queries over Uncertain Databases [Extended Abstract] Jef Wijsen

Université de Mons (UMONS) 20 Place du Parc 7000 Mons, Belgium

[email protected]

ABSTRACT

1. INTRODUCTION

A natural way for capturing uncertainty in the relational data model is by having relations that violate their primary key constraint, that is, relations in which distinct tuples agree on the primary key. A repair (or possible world) of a database is then obtained by selecting a maximal number of tuples without ever selecting two distinct tuples that have the same primary key value. For a Boolean query q, CERTAINTY(q) is the problem that takes as input a database db and asks whether q evaluates to true on every repair of db. We are interested in determining queries q for which CERTAINTY(q) is first-order expressible (and hence in the low complexity class AC0 ). For queries q in the class of conjunctive queries without self-join, we provide a necessary syntactic condition for firstorder expressibility of CERTAINTY(q). For acyclic queries (in the sense of [4]), this necessary condition is also a sufficient condition. So we obtain a decision procedure for firstorder expressibility of CERTAINTY(q) when q is acyclic and without self-join. We also show that if CERTAINTY(q) is first-order expressible, its first-order definition, commonly called (certain) first-order rewriting, can be constructed in a rather straightforward way.

Uncertainty is a phenomenon that is inherent in many database applications. A natural way for representing uncertainty is by having relations that violate their primary key constraint. Such uncertainty is not necessarily a bad thing. In planning databases, for example, primary key violations can represent different alternatives. In the following conference planning database, where primary keys are underlined, the exact town of VLDB 2016 is still uncertain (it can be Mons or Gent). On the other hand, even though we do not know the exact town, we can say that VLDB will be held in Belgium.

Categories and Subject Descriptors H.2.3 [Database Management]: Languages—query languages; H.2.4 [Database Management]: Systems—relational databases

General Terms Theory, Algorithms

Keywords Conjunctive queries, consistent query answering, first-order expressibility, primary keys

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PODS’10, June 6–11, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM 978-1-4503-0033-9/10/06 ...$10.00.

T R

Conf VLDB VLDB

Year 2016 2016

Town Mons Gent

Town Mons Gent Li`ege Li`ege

Country Belgium Belgium Belgium France

Uncertainty also arises as an inconvenient but inescapable consequence of data integration and data exchange. The relation T in the preceding example combines data from two different sources, each providing a different country for the city of Li`ege. Existing chase-based data exchange frameworks [8] provide no solution in this case, because the chase will fail when it tries to identify France and Belgium. Uncertainty by primary key violations gives rise to (exponentially many) “possible worlds,” which we will call repairs: every repair is obtained by selecting a maximal number of tuples from each relation without ever selecting two distinct tuples that agree on their primary key. A Boolean query is then certain if it evaluates to true on every repair. Our example database has four repairs, each satisfying the Boolean conjunctive query q1 = ∃x∃y(R(‘VLDB’, x, y) ∧ T(y, ‘Belgium’)), stating that VLDB will be organized in Belgium in some year. In this article, we are interested in determining certainty of Boolean queries by means of a technique known as (certain) first-order rewriting. To see that q1 is true in every repair, there is actually no need to evaluate q1 on all repairs. It suffices to check that the following first-order sentence ϕ1 evaluates to true on the original database:   ϕ1 =∃x∃y R(‘VLDB’, x, y) ∧ ∀y R ‘VLDB’, x, y) →  (T(y, ‘Belgium’) ∧ ∀z(T(y, z) → z = ‘Belgium’))

Formally, a certain first-order rewriting for a Boolean query q is a first-order sentence ϕ such that for every database db, q evaluates to true on every repair of db if and only if ϕ evaluates to true on db. The interest of first-order rewriting is evident: to know whether q is true in every repair, it suffices to execute ϕ once on the original database. Since ϕ is first-order, it can be encoded in SQL and executed in polynomial time data complexity using standard database technology. An alternative (but equivalent) way for defining first-order rewriting makes use of the following set, where q is any Boolean query: CERTAINTY(q) = {db | q evaluates to true on every repair of database db}. Saying that q has a certain first-order rewriting is then equivalent to saying that the set CERTAINTY(q) is first-order expressible. We study the decidability of first-order expressibility of CERTAINTY(q) when q is a conjunctive query without selfjoin. This issue is not new. The historically first syntactic class of conjunctive queries that allow certain first-order rewriting was introduced in 2005 by Fuxman and Miller under the name Cforest [10]. Later on, the class Cforest has been generalized in a number of ways. From the other side, it is known that many conjunctive queries do not allow certain first-order rewriting. The nonexistence of a certain firstorder rewriting can be settled by complexity-theoretic arguments: if CERTAINTY(q) is coNP-hard, then it is not first-order expressible; examples appear in [5, 10]. In [19, 20], Hanf-locality of first-order logic and Ehrenfeucht-Fra¨ıss´e games were used to obtain more inexpressibility results. All these works, however, prove first-order (in)expressibility of CERTAINTY(q) only under rather ad hoc syntactic restrictions on q. The current article presents a significant breakthrough: a sound and complete decision procedure for deciding first-order expressibility of CERTAINTY(q) for all acyclic (in the sense of [4]) conjunctive queries q without self-join. Moreover, if CERTAINTY(q) is first-order expressible, then a certain first-order rewriting for q can be constructed in a rather straightforward way. The set of acyclic conjunctive queries without self-join is a large class of queries of practical interest. We briefly discuss the remaining syntactic restrictions. First, the restriction to queries without self-join is not uncommon in uncertain [10] and probabilistic databases [7]. Moreover, it is known that self-joins quickly result in first-order inexpressibility [20]. Second, the acyclicity restriction implies the existence of join trees, which are helpful in the technical development. This acyclicity condition is also implicitly inherent in the class Cforest (see the proof of Corollary 5 in [20]). Not all our results, however, require acyclicity. The article is organized as follows. Section 2 introduces the mathematical concepts and terminology. Section 3 discusses related work. Section 4 introduces the construct of attack graph, a novel tool for studying first-order (in)expressibility. Section 5 gives a sufficient condition under which a (not-necessarily-acyclic) conjunctive query q, without selfjoin, has no certain first-order rewriting. Based on this, Sections 6 and 7 derive a sufficient and necessary condition for first-order expressibility of CERTAINTY(q) when q is also acyclic. Section 8 concludes the article. The appendix contains helping lemmas and certain proofs.

2. NOTATIONS AND TERMINOLOGY We assume disjoint sets of variables and constants. Variables and constants are symbols. If ~x is a sequence of symbols, then vars(~x) is the set of variables that occur in ~x. Let U be a set of variables. A valuation over U is a total mapping θ from U to the set of constants. Such valuation θ is often extended to be the identity on constants and on variables not in U . Key-equal atoms. Every relation name R has a fixed signature, which is a pair [n, k] with n ≥ k ≥ 1: the integer n is the arity of the relation name and {1, 2, . . . , k} is the primary key. If R is a relation name with signature [n, k], then R(s1 , . . . , sn ) is an R-atom (or simply atom), where each si is a constant or a variable (1 ≤ i ≤ n). Such atom is commonly written as R(~x, ~y) where ~x = s1 , . . . , sk and y ~ = sk+1 , . . . , sn . An atom is ground if it contains no variables. Two ground atoms R1 (~a1 , ~b1 ), R2 (~a2 , ~b2 ) are key-equal if R1 = R2 and ~a1 = ~a2 . Database and repair. A database schema is a finite set of relation names. All constructs that follow are defined relative to a fixed database schema. A database is a finite set db of ground atoms using only the relation names of the schema. A database db is consistent if it does not contain two distinct atoms that are key-equal. A repair of a database db is a maximal (under set inclusion) consistent subset of db. Boolean conjunctive query. A Boolean conjunctive query is a finite set q = {R1 (~x1 , ~y1 ),. . . , Rn (~xn , ~yn )} of atoms.1 This query q is satisfied by a database db, denoted db |= q, if there exists a valuation θ over vars(~x1 ~y1 . . . ~xn y ~n ) such that for each i ∈ {1, . . . , n}, Ri (θ(~xi ), θ(~ yi )) ∈ db. We say that q has a self-join if some relation name occurs more than once in q. The restriction to Boolean queries simplifies the technical treatment, but is not fundamental; Section 7 explains how to deal with nonBoolean queries. Since every relation name has a fixed signature, relevant primary key constraints are implicitly present in all queries; moreover, primary keys will be underlined. Consistent query answering. Let q be a Boolean conjunctive query and db a database. We write db|=sure q if for every repair rep of db, we have rep |= q. Given a Boolean conjunctive query q, CERTAINTY(q) is (the complexity of) the following set: CERTAINTY(q) = {db | db is a database and db|=sure q}. CERTAINTY(q) is said to be first-order expressible if there exists a first-order sentence ϕ such that for every database db, db ∈ CERTAINTY(q) if and only if db |= ϕ. The formula ϕ, if it exists, is called a certain first-order rewriting for q. Notational conventions. We will use letters A, B, C for ground atoms appearing in a database, and F, G, H, J for atoms appearing in a query. For F = R(~x, ~y), we denote by keyVars(F ) the set of variables that occur in ~x, and by allVars(F ) the set of variables that occur in F , that is, keyVars(F ) = vars(~x) and allVars(F ) = vars(~x) ∪ vars(~ y ). Acyclic conjunctive queries. An intersection tree for a conjunctive query q is an edge-labeled undirected tree whose vertices are the atoms of q; an edge between atoms F and G 1 Up to Section 7, all queries are understood to be Boolean and quantifiers are omitted.

is labeled by the (possibly empty) set allVars(F )∩allVars(G). An intersection tree for q is called a join tree for q if it satisfies the following condition: Connectedness Condition: whenever the same variable x occurs in two atoms F and G, then x occurs in each atom on the unique path linking F and G. The term Connectedness Condition appears in [11] and refers to the fact that the set of vertices in which x occurs induces a connected subtree. A conjunctive query q is acyclic if it has a join tree. The notions of join tree and acyclicity are standard [4]. The weaker notion of intersection tree is not in the standard literature, but is derived from the notion of intersection graph defined by Maier [16, page 453]. The symbol τ will be used for join trees, and ρ for intersection L

trees. We write F a G to denote an edge between F and G with label L. A join tree is shown in Fig. 1 (left). The query {R0 (y, z, u), R1 (x, y), R2 (z, x, u)} is cyclic and hence has no join tree. An intersection tree for that query is shown in Fig. 4 (left).

3.

RELATED WORK

Certain (or consistent) query rewriting goes back to [3]. Fuxman and Miller [10] were the first to focus on certain first-order rewriting of conjunctive queries under primary key constraints, with applications in the ConQuer system [9]. They defined a class of conjunctive queries without self-join, called Cforest , such that every query in that class has a certain first-order rewriting. At the same time, however, they recognized that the query q2 = ∃x∃y(R(x, y) ∧ S(x, y)) is not in Cforest and yet has a certain first-order rewriting. A larger class of conjunctive queries (including q2 ) that admit certain first-order rewriting was presented in [19], where it was also shown that a join of two distinct relations outside that class cannot have a certain first-order rewriting (but no such inexpressibility result was obtained for joins of three or more relations). The current article further improves the results of [19]. Like [19], the current article concentrates on conjunctive queries that are acyclic. From the proof of Corollary 5 in [20], it follows that acyclicity is also implicit in the class Cforest . That is, each Cforest query can be thought of as being composed of one or more Ctree queries, each of which is acyclic and has a certain first-order rewriting; sharing of variables among these Ctree components is highly restricted, so that a certain first-order rewriting of the global query can be obviously constructed from the rewritings of each component. In [18, 20], we defined the semantic class Crooted of conjunctive queries (possibly cyclic and with self-joins), and we showed first-order expressibility of CERTAINTY(q) for all q ∈ Crooted . However, no membership test for Crooted is known. From the results in the current article, it follows that if an acyclic conjunctive query without self-join has a certain first-order rewriting, then it belongs to Crooted . The problem of certain conjunctive query answering under primary keys can be extended in several ways. Grieco et al. [12] have studied certain query answering under both key and exclusion dependencies. Lembo et al. [14] have studied certain first-order order rewriting of unions of conjunctive queries under key dependencies. Finally, uncertainty by pri-

mary key violations is a first step to probabilistic databases, which also assume some key constraint [2, 6, 13].

4. ATTACK GRAPH We compute for each intersection tree a new graph, called attack graph. Attack graphs will be the tool used for deciding the existence of certain first-order rewritings. Every atom F in a query q gives rise to a functional dependency among the variables that occur in F . For example, R(x, y, z) gives rise to {x, y} → {z}. The set K(q) defined next collects all functional dependencies that arise in atoms of q. Definition 1. Let q be a Boolean conjunctive query. We define K(q) as the following set of functional dependencies: K(q) = {keyVars(F ) → allVars(F ) | F ∈ q}. Example 1. Let q3 denote the query whose join tree τ3 is shown in Fig. 1 (left). Then, K(q3 ) contains {x} → {x, y}, {y} → {x, y}, and {} → {x}. The latter functional dependency arises in the atom R2 (a, x) whose primary key value contains no variables. To understand the statement of the following lemma, notice that a valuation over a finite set of variables can be regarded as a tuple by treating variables as attributes. For example, let θ be the valuation (or tuple) over {x, y, z} defined by θ = {x 7→ a, y 7→ b, z 7→ c}. Let µ = {x 7→ a, y 7→ b, z 7→ d}. Then, {θ, µ} satisfies the functional dependency x → y because θ and µ agree on y. On the other hand, {θ, µ} falsifies x → z because θ and µ agree on x but disagree on z. Lemma 1. Let q be a Boolean conjunctive query (possibly with self-joins). Let U be the set of variables that occur in q. Let db be a consistent database. If θ, µ are valuations over U such that θ(q), µ(q) ⊆ db, then {θ, µ} |= K(q). Recall from relational database theory [17, page 387] that if Σ is a set of functional dependencies over a set U of attributes and X ⊆ U , then the attribute closure of X (with respect to Σ) is the set {A ∈ U | Σ |= X → A}. We say that X is closed if X is equal to the closure of X. Definition 2. Let q be a Boolean conjunctive query. Let U be the set of variables that occur in q. For every F ∈ q, we define: F +,q = {x ∈ U | K(q \ {F }) |= keyVars(F ) → x}. In words, F +,q is the attribute closure of the set keyVars(F ) with respect to the set of functional dependencies that arise in the atoms of q \ {F }. Note that variables play the role of attributes in our framework. We now define attack graphs. Every intersection tree has a unique attack graph. The vertices of an intersection tree and its attack graph are the same. Attack graphs, unlike intersection trees, are directed graphs. The construct of attack graph will turn out to be a powerful tool for characterizing first-order expressibility of CERTAINTY(q). In particular, we will show that the following three properties hold for all Boolean conjunctive queries q without self-join:

R0 (x, y) = F {x, y}

R1 (y, x) = G

R0 (x, y)

{x}

R2 (a, x) = H

R1 (y, x)

R2 (a, x)

Figure 1: Join tree τ3 (left) and attack graph (right) for query q3 . The attack graph is acyclic. R0 (x, y) = F {x, y}

R1 (y, x) = G {x, y}

R1 (y, x)

R0 (x, y)

R2 (x, y) = H

R2 (x, y)

{x} R3 (x, z) = J

R3 (x, z)

R4 (x, z) = K

R4 (x, z)

{x, z}

Figure 2: Join tree τ4 (left) and attack graph (right) for query q4 . The attack graph contains a cycle. • If ρ is an intersection tree for q such that the attack graph of ρ has a cycle with exactly two vertices, then CERTAINTY(q) is not first-order expressible. This is Theorem 1. • If τ is a join tree for q (meaning that q is acyclic) and the attack graph of τ has a cycle (of any length), then CERTAINTY(q) is not first-order expressible. This is Theorem 2. • If τ is a join tree for q and the attack graph of τ is acyclic, then CERTAINTY(q) is first-order expressible. In this case, a certain first-order rewriting for q can be constructed from a topological sorting of the attack graph. This is Theorem 3. Definition 3. Let ρ be an intersection tree for Boolean conjunctive query q. The attack graph of ρ is a directed graph whose vertices are the atoms of q. There is a directed edge from F to G if F, G are distinct atoms such that for every label L on the unique path that links F and G in ρ, we have L * F +,q . ρ We write F G if the attack graph of ρ contains a diρ rected edge from F to G. If F G, we say that F attacks G (or that G is attacked by F ). A cycle of size n in the atτ τ τ tack graph is then a sequence of edges F0 F1 F2 . . . τ Fn−1 F0 . Example 2. Consider again join tree τ3 for query q3 in Fig. 1 (left). To shorten notation, let F = R0 (x, y), G = R1 (y, x), and H = R2 (a, x), as indicated in Fig. 1 (left). The attack graph of τ3 is shown in Fig. 1 (right) and is computed as follows. We have: K(q3 \ {F }) =

{{y} → {x, y}, {} → {x}}

K(q3 \ {G}) = K(q3 \ {H}) =

{{x} → {x, y}, {} → {x}} {{x} → {x, y}, {y} → {x, y}}

We have keyVars(F ) = {x}, which is already closed with respect to K(q3 \ {F }). Thus, F +,q3 = {x}. The path from {x,y}

F to G in the join tree is F a G. Since the label {x, y} is not contained in F +,q3 , the attack graph contains a directed τ edge from F to G, i.e. F 3 G. The path from F to H in {x}

the join tree is F a H. Since the label {x} is contained in F +,q3 , the attack graph contains no directed edge from F to H. We have keyVars(G) = {y} and the closure of {y} with respect to K(q3 \ {G}) is {x, y}. Thus, G+,q3 = {x, y}. The {x,y}

path from G to F in the join tree is G a F . Since the label {x, y} is contained in G+,q3 , the attack graph contains no directed edge from G to F . For that same reason, the attack graph contains no directed edge from G to H. Finally, we have keyVars(H) = {}, which is already closed with respect to K(q3 \ {H}). Thus, H +,q3 = {}. The path {x}

{x,y}

from H to G in the join tree is H a F a G. Since no label on that path is contained in H +,q3 , the attack graph τ contains a directed edge from H to G, i.e. H 3 G. It is then obvious that the attack graph must also contain a directed τ edge from H to F , i.e. H 3 F . Example 3. Consider join tree τ4 for query q4 in Fig. 2 (left). Notice that R2 is full-key, i.e. R2 has arity 2 and its primary key is {1, 2}. We have K(q4 \ {F }) ≡ {y → x, x → z}, K(q4 \ {G}) ≡ {x → y, x → z}, and K(q4 \ {H}) = K(q4 \ {J}) = K(q4 \ {K}) ≡ {x → y, y → x, x → z}. Consequently, F +,q4 = {x, z} G+,q4 = {y} H +,q4 = {x, y, z}

J +,q4 = {x, y, z} K +,q4 = {x, y, z}

θ1 (F ) α

θ1 (G)

θ1 (F ) α

θ1 (G)

θ2 (F )

θ2 (G)

θ2 (F )

θ2 (G)

θ3 (F )

θ3 (G)

θ3 (F )

θ3 (G)

θ4 (F )

θ4 (G)

θ4 (F )

θ4 (G)

θ5 (G)

θ5 (F )

θ5 (F )

β γ

δ γ

µ1 (G)

µ1 (G)

µ2 (F )

µ2 (G)

µ2 (F )

µ2 (G)

µ3 (F )

µ3 (G)

µ3 (F )

µ3 (G)

µ4 (F )

µ4 (G)

µ4 (F )

µ4 (G)

µ5 (F )

µ5 (F )

δ

β

dbyes

µ5 (G)

dbno

Figure 3: Schematic representation of atoms in dbyes and dbno . Since no edge label is contained in G+,q4 , the atom G attacks every other atom. The completed attack graph is shown in τ τ Fig. 2 (right). It contains a cycle F 4 G 4 F of size 2. τ

τ

Notice that F 4 G and G 4 J, but F 6 graphs need not be transitive.

τ4

J. So attack

Definition 4. Let q be a Boolean conjunctive query. Let ρ be an intersection tree for q. If F, G are distinct vertices in ρ, then [F, G]ρ denotes the set of all atoms on the unique path linking F and G (including F and G). Lemma 2. Let q be a Boolean conjunctive query. Let ρ be an intersection tree for q. Let F, G be distinct atoms of q. 1. If F 2. If F 3. If F

5.

ρ

G, then keyVars(G) * F +,q .

ρ

G and H ∈ [F, G]ρ \ {F }, then F

ρ

G, then K(q \ [F, G]ρ ) |= keyVars(F ) → F +,q .

ρ

H.

INEXPRESSIBILITY RESULT

This section applies to not-necessarily-acyclic Boolean conjunctive queries without self-join. We provide a necessary condition for such queries to have a certain first-order rewriting. Theorem 1. Let q be a (not-necessarily-acyclic) Boolean conjunctive query without self-join. If q has an intersection tree ρ whose attack graph has a cycle of size 2, then CERTAINTY(q) is not first-order expressible. Proof Sketch. Let ρ be an intersection tree for q such that the attack graph of ρ contains a cycle of size 2. We can ρ ρ assume two atoms F, G such that F G and G F . For

each first-order sentence ψ, we can construct two databases, called dbyes and dbno , that are indistinguishable by ψ, such that dbyes |= sure q and dbno 6|= sure q. A small-sized construction is schematized in Fig. 3 (the actual size will depend on the quantifier rank of ψ). We restrict our attention to atoms with the same relation name as F or G. Conflicting, keyequal atoms are linked by double-arrowed edges. Valuations θ1 , µ1 , θ2 , µ2 , . . . are such that they realize these key-equal atoms and otherwise map distinct variables to new distinct constants. The construction ensures that two atoms cannot “join” unless they were introduced by the same valuation. More formally, assume a valuation ω such that ω(q) ⊆ db no (the case where ω(q) ⊆ dbyes is analogous); if ω(F ) = θi (F ), then ω(G) = θi (G) and vice versa (1 ≤ i ≤ 5). Likewise, ω(F ) = µi (F ) if and only if ω(G) = µi (G). Database dbno has a repair containing both {θ1 (F ), θ2 (G), θ3 (F ), θ4 (G), θ5 (F )} and {µ1 (G), µ2 (F ), µ3 (G), µ4 (F ), µ5 (G)}; such repair does not satisfy q. On the other hand, every repair of dbyes satisfies q, as every repair will contain both θi (F ) and θi (G) for some 1 ≤ i ≤ 5. Finally, the construction ensures that dbyes and dbno locally look the same. Intuitively, in Fig. 3, not-too-large neighborhoods of α look the same in both databases; likewise for β, γ, δ. It is instructive to compare the positions of β and δ in dbyes and dbno . The result then follows from Hanf-locality of first-order logic [15, Theorem 4.12]. Theorem 1 applies to both cyclic and acyclic queries. Consider the cyclic query q5 = {R0 (y, z, u), R1 (x, y), R2 (z, x, u)}. Figure 4 shows an intersection tree for q5 whose attack graph is cyclic. It follows that CERTAINTY(q5 ) is not first-order expressible. Remarkably, we will show that for acyclic queries (i.e. queries that have a join tree), the converse of Theorem 1 also holds. We will actually obtain a stronger result: if an acyclic Boolean conjunctive query q, without self-join, has no certain first-order rewriting, then every join tree for q has an attack graph with a cycle of size 2.

6. JOIN TREES WITH CYCLIC ATTACK GRAPHS From now on, we focus on conjunctive queries that have a join tree, which are called acyclic in the literature [4]. It is important to understand that the attack graph of a join tree can be cyclic (see Fig. 2) or acyclic (see Fig. 1). In this section, we show that if the attack graph of a join tree has a cycle, then it has a cycle of size 2. Definition 5. Let q be a Boolean conjunctive query. Let τ τ τ τ τ be a join tree for q. Let F0 F1 F2 . . . Fn−1 F0 be a cycle (of size n) in the attack graph of τ . This cycle is said to be shortest if the attack graph of τ contains no cycle of (strictly) smaller size. Example 4. Consider the join tree τ6 shown in Fig. 5 (left), τ whose attack graph is shown at the right. The cycles G 6 τ6 τ6 τ6 H G and F H F , both of size 2, are shortest. The τ τ τ cycle F 6 H 6 G 6 F of size 3 is not shortest. Lemma 3. Let q be a Boolean conjunctive query. Let τ be a join tree for q. Every shortest cycle in the attack graph of τ has size 2.

R0 (y, z, u) {y}

R0 (y, z, u)

{z, u}

R1 (x, y)

R2 (z, x, u)

R1 (x, y)

R2 (z, x, u)

Figure 4: Intersection tree ρ5 (left) and attack graph (right) for the cyclic query q5 . The attack graph is cyclic. Notice that ρ5 is not a join tree. R0 (x, y, z) = F {x, y}

R1 (x, y) = G

R0 (x, y, z) = F

{x, z}

R2 (z, x) = H

R1 (x, y) = G

R2 (z, x) = H

Figure 5: Join tree τ6 (left) and attack graph (right) for query q6 . The attack graph is cyclic. Theorem 2. Let q be a Boolean conjunctive query without self-join. Let τ be a join tree for q. If the attack graph of τ is cyclic, then CERTAINTY(q) is not first-order expressible. Proof. If the attack graph of τ is cyclic, it must contain a shortest cycle of some size n. By Lemma 3, n = 2. The desired result then follows by Theorem 1.

7.

JOIN TREES WITH ACYCLIC ATTACK GRAPHS

We show that the inverse of Theorem 2 is also true. That is, if a conjunctive query q, without self-join, has a join tree τ whose attack graph is acyclic, then q has a certain first-order rewriting. Moreover, we show how such first-order rewriting can be constructed. We relax our assumption that all variables in a conjunctive query q are (implicitly) existentially quantified. Hereafter, the notation q(x1 , x2 , . . . , xn ) is used to indicate that variables x1 , x2 , . . . , xn are free in q (while all other variables of q remain existentially quantified). The notion of certain first-order rewriting naturally extends to such nonBoolean queries with free variables. Definition 6. Let q(x1 , x2 , . . . , xn ) be a conjunctive query with free variables x1 , x2 , . . . , xn . We say that a first-order formula ϕ(x1 , x2 , . . . , xn ) is a certain first-order rewriting for q(x1 , x2 , . . . , xn ) if for every database db, for all constants a1 , a2 , . . . , a n , db|= sure q(a1 , a2 , . . . , an ) ⇐⇒ db |= ϕ(a1 , a2 , . . . , an ). Definition 7 introduces our “rewrite function” Rewrite(F, q) for any Boolean conjunctive query q containing atom F . The definition assumes that we already know a certain first-order rewriting ϕ(~v ) for the smaller, nonBoolean query q \ {F } whose free variables are allVars(F ). In front of Theorem 3, we will argue that ϕ(~v ) can be obtained by recursive application of the same rewrite function. Lemma 4 states that,

under some condition, our rewrite function produces a certain first-order rewriting for q. The intuition behind the rewrite function of Definition 7 goes as follows. Let F = R(~x, y1 , y2 , . . . , yn ), an atom of q. It is easy to see that db|=sure q if database db contains an R-atom R(~a, ~c) such that: • there exists a (unique) valuation θ over vars(~x) such that θ(~x) = ~a; and • for each R-atom R(~a, b1 , . . . , bn ) in db, we can extend θ to a valuation θ + over allVars(F ) such that θ + (yi ) = bi for 1 ≤ i ≤ n and such that db|=sure θ+ (q 0 ), where q 0 = q \ {F }. We assume that R does not occur in q 0 (no self-join). Concerning the latter item, to determine whether valuation θ can be extended from vars(~x) to allVars(F ), we can fix θ+ (y1 ), θ+ (y2 ), θ+ (y3 ),. . . in that order. Assume that the value of θ + has already been fixed for y1 , y2 , . . . , yi−1 . Two cases are possible for yi : 1. If yi is a variable that does not occur in h~x, y1 , . . . , yi−1 i, then no restriction applies: we can simply let θ + (yi ) be identical to bi . Correspondingly, in the first item of Definition 7, we will let yi and zi be identical. 2. Otherwise θ + (yi ) is already fixed, and θ + does not exist unless bi turns out to be equal to this already fixed value (correspondingly, in the second item of Definition 7, we will require zi = yi ). Indeed, if yi is a variable that occurs in ~x, then θ + (yi ) is determined by the equality θ(~x) = ~a. If yi ∈ {y1 , . . . , yi−1 }, then θ+ (yi ) has already been determined by our initial assumption. Finally, if yi is a constant, then θ + (yi ) is determined by θ + (yi ) = yi , because every valuation is the identity on constants. Definition 7. Let q be a Boolean conjunctive query without self-join. Let R(~x, ~y) be an atom of q, and let ~y =

hy1 , y2 , . . . , yn i. Notice that y ~ can contain constants and repeated variables. Let ~z = hz1 , z2 , . . . , zn i be a sequence of distinct variables and let C be a conjunction of equalities constructed as follows for 1 ≤ i ≤ n, 1. if yi is a variable that does not occur in the sequence h~x, y1 , y2 , . . . , yi−1 i, then zi is identical to yi ; 2. otherwise zi is a new variable and C contains zi = yi . Let ~v be a sequence of variables that contains exactly once each variable that occurs in R(~x, ~y). Let ϕ(~v ) be a certain first-order rewriting for q 0 (~v ), where q 0 (~v ) is the nonBoolean conjunctive query whose set of atoms is q \ {R(~x, ~y)} (and whose free variables are ~v ). Obviously, if q 0 is empty, then ϕ = true. We define:  Rewrite(R(~x, ~y ), q) = ∃~v R(~x, ~y)∧  ∀~z R(~x, ~z) → (C ∧ ϕ(~v ))

If q 0 (~v ) has no certain first-order rewriting, then the value of Rewrite(R(~x, ~y ), q) is undefined.

Example 5. Let q be a Boolean conjunctive query without self-join. Let F = R(x, x, y, y, a) be an atom of q, where a is a constant. The non-primary-key values in F are hx, y, y, ai. The sequence ~z of Definition 7 becomes hz1 , y, z3 , z4 i and C equals z1 = x ∧ z3 = y ∧ z4 = a. Notice that there is no variable z2 ; the variable y is “reused” instead. Let ϕ(x, y) be a certain first-order rewriting for q 0 (x, y), where q 0 (x, y) is the nonBoolean conjunctive query whose set of atoms is q \ {F }. Then,  Rewrite(F, q) = ∃x∃y R(x, x, y, y, a)∧ ∀z1 ∀y∀z3 ∀z4 R(x, z1 , y, z3 , z4 ) →  (z1 = x ∧ z3 = y ∧ z4 = a ∧ ϕ(x, y)) Our rewrite function Rewrite(F, q) starts with an existential quantification over the variables in keyVars(F ), whose model-theoretic interpretation leads to the following definition. Definition 8. Let q be a Boolean conjunctive query. An atom F of q is said to be reifiable if for every database db, db|= sure q implies db|= sure θ(q) for some valuation θ over keyVars(F ).

Definition 9. Let q be a Boolean conjunctive query without self-join. Let U be the set of variables that occur in q. Let rep be a repair of some database. For an atom F of q, we define: Reify(q, F, rep) = {θ | θ valuation over keyVars(F ) and rep |= θ(q)}. We say that an atom A ∈ rep is relevant for q in rep if for some valuation θ over U , A ∈ θ(q) ⊆ rep. Let r, s be two repairs of the same database db. A uniformisation of [r, s] is a maximal sequence: [r, s] = [r0 , s0 ], [r1 , s1 ], . . . , [rn , sn ] where for each i ∈ {0, 1, . . . , n − 1}, there exist atoms A ∈ ri and B ∈ si such that A and B are key-equal and A 6= B, and one of the following conditions is true: • A is relevant for q in ri and ri+1 = (ri \ {A}) ∪ {B} and si+1 = si ; or • B is relevant for q in si and ri+1 = ri and si+1 = (si \ {B}) ∪ {A}. That is, in a uniformisation we repeatedly replace a relevant atom in either repair with its key-equal (but distinct) atom in the other repair. Example 6. Let q2 = {R(x, y), S(x, y)} and r0 s0

= =

{R(a, b), S(a, b), R(c, 1), S(c, 2)}, {R(a, 1), S(a, 2), R(c, d), S(c, d)}.

Since R(a, b) is relevant for q2 in r0 , it is replaced with R(a, 1), giving: r1 s1

= {R(a, 1), S(a, b), R(c, 1), S(c, 2)}, = {R(a, 1), S(a, 2), R(c, d), S(c, d)}.

Since R(c, d) is relevant for q2 in s1 , it is replaced with R(c, 1), giving: r2 s2

= {R(a, 1), S(a, b), R(c, 1), S(c, 2)}, = {R(a, 1), S(a, 2), R(c, 1), S(c, d)}.

The uniformisation terminates, because there are no more relevant atoms. It is easy to see that every uniformisation must eventually terminate. Lemma 5. Let q be a Boolean conjunctive query without self-join. Let τ be a join tree for q. Let F ∈ q such that τ

Lemma 4. Let q be a Boolean conjunctive query without self-join. Let F be a reifiable atom of q. If Rewrite(F, q) is defined, then it is a certain first-order rewriting for q.

for every G ∈ q \ {F }, G 6 F . Let r, s be two repairs of the same database db. Let [rn , sn ] be the last element in a uniformisation of [r, s]. Then, 1. rn is a repair of db; and

Since Lemma 4 only applies to queries q that contain a reifiable atom F , it is important to recognize reifiable atoms. The construct of attack graph is helpful here: Corollary 1 states that an atom F of q is reifiable if F is not attacked in the attack graph of τ , where τ is any join tree for q. We first show that if an atom F is not attacked, then for each database db, for all repairs r and s of db, there exists a repair rep of r ∪ s (and hence of db) such that for every valuation θ over keyVars(F ), rep |= θ(q) implies r |= θ(q) and s |= θ(q). The construction of such repair rep is specified next.

2. Reify(q, F, rn ) ⊆ Reify(q, F, r) ∩ Reify(q, F, s). Corollary 1. Let q be a Boolean conjunctive query without self-join. Let τ be a join tree for q. Let F ∈ q such that τ

for every G ∈ q \ {F }, G 6

F . Then F is reifiable.

The proof of Theorem 3 relies on two main arguments. First, since every directed acyclic graph contains a vertex without incoming edge, it follows that an acyclic attack graph contains a nonattacked atom, which is reifiable by

Corollary 1 and thus allows the application of Lemma 4. Second, constructing certain first-order rewritings for nonBoolean conjunctive queries q(x1 , x2 , . . . , xn ) is no more difficult than for Boolean queries (always without self-join). Let c1 , c2 , . . . , cn be n new constants. Let q 0 be the query obtained from q by replacing all occurrences of xi with ci (1 ≤ i ≤ n). Assume we know how to construct a certain first-order rewriting ϕ for the Boolean conjunctive query q 0 . A certain first-order rewriting for q(x1 , x2 , . . . , xn ) can then be obtained from ϕ by replacing all occurrences of ci with xi (1 ≤ i ≤ n). This is correct because our rewrite function treats constants as generic. Theorem 3. Let q be a Boolean conjunctive query without self-join. Let τ be a join tree for q. If the attack graph of τ contains no cycle, then CERTAINTY(q) is first-order expressible. Finally, by combining Theorems 2 and 3, we obtain the following result. Corollary 2. Let q be a Boolean conjunctive query without self-join. Let τ be a join tree for q. CERTAINTY(q) is first-order expressible if and only if the attack graph of τ is acyclic.

8.

CONCLUDING REMARKS

If τ is a join tree for conjunctive query q, without selfjoin, then acyclicity of the attack graph of τ is sufficient and necessary for the first-order expressibility of CERTAINTY(q). Consequently, we can decide whether an acyclic (in the sense of [4]) conjunctive query q, without self-join, has a certain first-order rewriting. Moreover, such certain first-order rewriting, if it exists, can be constructed in a rather straightforward way. An issue for future research concerns relaxing the acyclicity and no-self-join assumptions. Although Theorem 1 applies to cyclic queries, it only provides a necessary condition for first-order expressibility of CERTAINTY(q) when q is cyclic. Currently, not much is known about the first-order expressibility of CERTAINTY(q) when q contains self-joins.

9.

REFERENCES

[1] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. [2] P. Andritsos, A. Fuxman, and R. J. Miller. Clean answers over dirty databases: A probabilistic approach. In L. Liu, A. Reuter, K.-Y. Whang, and J. Zhang, editors, ICDE, page 30. IEEE Computer Society, 2006. [3] M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68–79. ACM Press, 1999. [4] C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. J. ACM, 30(3):479–513, 1983. [5] J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Inf. Comput., 197(1-2):90–121, 2005. [6] N. N. Dalvi, C. R´e, and D. Suciu. Probabilistic databases: diamonds in the dirt. Commun. ACM, 52(7):86–94, 2009.

[7] N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB J., 16(4):523–544, 2007. [8] R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theor. Comput. Sci., 336(1):89–124, 2005. [9] A. Fuxman, E. Fazli, and R. J. Miller. Conquer: Efficient management of inconsistent databases. In ¨ F. Ozcan, editor, SIGMOD Conference, pages 155–166. ACM, 2005. [10] A. Fuxman and R. J. Miller. First-order query rewriting for inconsistent databases. J. Comput. Syst. Sci., 73(4):610–635, 2007. [11] G. Gottlob, N. Leone, and F. Scarcello. Hypertree decompositions and tractable queries. J. Comput. Syst. Sci., 64(3):579–627, 2002. [12] L. Grieco, D. Lembo, R. Rosati, and M. Ruzzi. Consistent query answering under key and exclusion dependencies: algorithms and experiments. In O. Herzog, H.-J. Schek, N. Fuhr, A. Chowdhury, and W. Teiken, editors, CIKM, pages 792–799. ACM, 2005. [13] J. Huang, L. Antova, C. Koch, and D. Olteanu. Maybms: a probabilistic database management system. In U. Cetintemel, ¸ S. B. Zdonik, D. Kossmann, and N. Tatbul, editors, SIGMOD Conference, pages 1071–1074. ACM, 2009. [14] D. Lembo, R. Rosati, and M. Ruzzi. On the first-order reducibility of unions of conjunctive queries over inconsistent databases. In T. Grust, H. H¨ opfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. M¨ uller, P.-L. Patranjan, K.-U. Sattler, M. Spiliopoulou, and J. Wijsen, editors, EDBT Workshops, volume 4254 of Lecture Notes in Computer Science, pages 358–374. Springer, 2006. [15] L. Libkin. Elements of Finite Model Theory. Springer, 2004. [16] D. Maier. The Theory of Relational Databases. Computer Science Press, 1983. [17] J. D. Ullman. Principles of Database and Knowledge-Base Systems, Volume I. Computer Science Press, 1988. [18] J. Wijsen. On the consistent rewriting of conjunctive queries under primary key constraints. In M. Arenas and M. I. Schwartzbach, editors, DBPL, volume 4797 of Lecture Notes in Computer Science, pages 112–126. Springer, 2007. [19] J. Wijsen. Consistent query answering under primary keys: a characterization of tractable queries. In R. Fagin, editor, ICDT, volume 361 of ACM International Conference Proceeding Series, pages 42–52. ACM, 2009. [20] J. Wijsen. On the consistent rewriting of conjunctive queries under primary key constraints. Inf. Syst., 34(7):578–601, 2009.

APPENDIX A.

PROOF OF LEMMA 1

Proof Lemma 1. Let X → Y be an arbitrary functional dependency in K(q). We can assume an atom R(~x, ~y ) of q such that X = vars(~x) and Y = vars(~x) ∪ vars(~ y ). Assume

L

θ[X] = µ[X]. We have θ(~x) = µ(~x). From θ(q), µ(q) ⊆ db, it follows that R(θ(~x), θ(~ y )), R(µ(~x), µ(~ y)) ∈ db. Since θ(~x) = µ(~x) and since db contains no two distinct key-equal atoms, we have θ(~ y) = µ(~ y). It follows θ[Y ] = µ[Y ].

B.

τF

τG

F

G

PROOF OF LEMMA 2

Proof Lemma 2. (1) Proof by contraposition. Assume keyVars(G) ⊆ F +,q , that is, K(q \ {F }) |= keyVars(F ) → keyVars(G). Let L be the last label on the path from F to G. Since keyVars(G) → allVars(G) belongs to K(q \ {F }) and L ⊆ allVars(G), it follows K(q \ {F }) |= keyVars(G) → L. By transitivity, K(q \ {F }) |= keyVars(F ) → L. Then L ⊆ F +,q , hence

Figure 6: Join tree τ in the premise of Lemma 6. to constructing a maximal sequence (see also the proof of Lemma 2) keyVars(F )

ρ

F 6 G. (2) Straightforward. ρ (3) Assume F G. The computation of the attribute clo+,q sure F by means of a standard algorithm [1, page 165] corresponds to constructing a maximal sequence keyVars(F ) = S0 S1 .. . Sk−1 Sk

H1 H2 .. . Hk

where 1. S0 ( S1 ( · · · ( Sk−1 ( Sk ; and 2. for every i ∈ {1, 2, . . . , k},

1. S0 ( S1 ( · · · ( Sk−1 ( Sk ; and 2. for every i ∈ {1, 2, . . . , k}, we have Hi ∈ q 0 such that keyVars(Hi ) ⊆ Si−1 and Si = Si−1 ∪ allVars(Hi ). Then, for every x ∈ U , K(q 0 ) |= keyVars(F ) → x if and only if x ∈ Sk . We show by induction on increasing i that for every 0 ≤ i ≤ k, K(q 0 ) |= L → Si ∩ UG . The induction basis, i = 0, holds because keyVars(F ) ∩ UG ⊆ L by the Connectedness Condition on join trees. For the induction step, i → i + 1, we distinguish two cases. 1. Hi+1 belongs to τF . By the Connectedness Condition, allVars(Hi+1 ) ∩ UG ⊆ L. Thus, Si+1 ∩ UG ⊆ (Si ∩ UG ) ∪ L. Since K(q 0 ) |= L → Si ∩ UG by the induction hypothesis, K(q 0 ) |= L → (Si ∩ UG ) ∪ L. Consequently, K(q 0 ) |= L → Si+1 ∩ UG .

Then, Sk = F +,q . Assume Hi ∈ [F, G]ρ for some i ∈ {1, 2, . . . , k}. Then keyVars(Hi ) ⊆ F +,q . By item (1) in the

2. Hi+1 belongs to τG , hence allVars(Hi+1 ) ⊆ UG . We have Si+1 ∩ UG = (Si ∩ UG ) ∪ allVars(Hi+1 ). Since K(q 0 ) |= L → Si ∩ UG by the induction hypothesis, it suffices to show K(q 0 ) |= L → allVars(Hi+1 ). Since keyVars(Hi+1 ) ⊆ Si ∩ UG and K(q 0 ) |= L → Si ∩ UG by the induction hypothesis, we have K(q 0 ) |= L → keyVars(Hi+1 ). Since Hi+1 ∈ q 0 , the set K(q 0 ) contains keyVars(Hi+1 ) → allVars(Hi+1 ). Consequently, K(q 0 ) |= L → allVars(Hi+1 ).

ρ

current proof, F 6 Hi . By item (2) in the current proof, ρ F Hi , a contradiction. We conclude by contradiction that Hi ∈ / [F, G]ρ . It follows K(q \ [F, G]ρ ) |= keyVars(F ) → F +,q .

PROOF OF LEMMA 3

We will use two helping lemmas. Lemma 6 is technical. The situation required by the premise is illustrated in Fig. 6: the join tree τ consists of two join trees, τF and τG , linked by an edge with label L; the atom F occurs in τF , and G in τG .

Lemma 7. Let q be a Boolean conjunctive query. Let τ be a join tree for q. Let F0

Lemma 6. Let q be a Boolean conjunctive query. Let τ be a join tree for q. Let F, G be distinct atoms of q. Let e be an edge with label L on the path between F and G. Let τF and τG be the join trees obtained from τ by cutting the edge e, such that F ∈ τF and G ∈ τG . Let UG be the set of variables that occur in τG . For every q 0 ⊆ q, for every S ⊆ UG , if K(q 0 ) |= keyVars(F ) → S, then K(q 0 ) |= L → S. Proof. Let U be the set of variables that occur in q. The computation of the set {x ∈ U | K(q 0 ) |= keyVars(F ) → x} by means of a standard algorithm [1, page 165] corresponds

H1 H2 .. . Hk

where

(a) Hi ∈ q \ {F }. Thus, K(q \ {F }) contains the functional dependency keyVars(Hi ) → allVars(Hi ). (b) keyVars(Hi ) ⊆ Si−1 and Si = Si−1 ∪ allVars(Hi ).

C.

= S0 S1 .. . Sk−1 Sk

τ

F1

τ

F2 . . .

τ

Fn−1

τ

F0

be a shortest cycle in the attack graph of τ . Then, for all i, j, k ∈ {0, 1, . . . , n − 1}:2 1. i 6= j implies Fi 6= Fj . τ

2. If i 6= j 6= i ⊕ 1, then Fi 6

Fj .

3. If i 6= j 6= i ⊕ 1, then Fj ∈ / [Fi , Fi⊕1 ]τ . 4. If i, j, k are distinct, then Fj ∈ / [Fi , Fk ]τ . 2

We write i ⊕ k for (i + k) mod n.

Proof. τ τ (1) Assume i = 6 j and Fi = Fj . Then Fi Fj⊕1 τ Fj⊕2 . . . Fi is a cycle of smaller size, a contradiction. τ

τ

τ

(2) Assume Fi Fj with i 6= j 6= i ⊕ 1. Then Fi Fj τ τ Fj⊕1 Fj⊕2 . . . Fi is a cycle of smaller size, a contradiction. (3) Assume Fj ∈ [Fi , Fi⊕1 ]τ with i 6= j 6= i ⊕ 1. Since τ τ Fi⊕1 , we have Fi Fj by item (2) in Lemma 2. By Fi item (2) in the current proof, we obtain a contradiction. (4) Assume Fj ∈ [Fi , Fk ]τ with i, j, k all distinct. We can assume without loss of generality that the shortest cycle is of the form: Fi

τ

...

τ

Fj . . .

τ

τ

Fk . . .

Fi

Then Fj must belong to one of [Fk , Fk⊕1 ]τ , [Fk⊕1 , Fk⊕2 ]τ , . . . , or [Fi 1 , Fi ]τ , which contradicts item (3) of the current proof. The proof of Lemma 3 is given next. τ

τ

τ

Proof Lemma 3. Assume that F0 F1 F2 . . . τ Fn−1 F0 is a shortest cycle of size n in the attack graph of τ . We need to show n = 2. Assume, on he contrary, n ≥ 3. By item (4) in Lemma 7, F0 ∈ / [F1 , F2 ]τ , F1 ∈ / [F0 , F2 ]τ , and F2 ∈ / [F0 , F1 ]τ . τ

Since F1 6 F0 by item (2) in Lemma 7, we can assume an edge e1 with label L1 on the path (in τ ) linking F0 and F1 τ such that K(q \ {F1 }) |= keyVars(F1 ) → L1 . Since F1 F2 , e1 cannot an edge on the path linking F1 and F2 . τ

Since F0 6 F2 , we can assume an edge e0 with label L0 on the path linking F0 and F2 such that K(q \ {F0 }) |= τ keyVars(F0 ) → L0 . Since F0 F1 , e0 cannot be an edge on the path linking F0 and F1 . Thus, for some G 6∈ {F0 , F1 , F2 }, the join tree τ must contain a subtree of the following form, where straight lines denote paths of some unspecified length.

F0

L1

F1 G L0

F2

τ

Since F0 F1 , we have K(q \ {F0 , F1 }) |= keyVars(F0 ) → L0 by item (3) of Lemma 2 . By Lemma 6, K(q \ {F0 , F1 }) |= L1 → L0 . Consequently, K(q \ {F1 }) |= L1 → L0 . By transitivity, K(q \ {F1 }) |= keyVars(F1 ) → L0 . But then τ

F1 6 F2 , a contradiction. We conclude by contradiction n < 3.

D.

PROOF OF LEMMA 4

Proof Lemma 4. Assume Rewrite(F, q) is defined. Let F = R(~x, ~y). Let ~z, ~v , C, ϕ(~v ), and q 0 as in Definition 7. Let db be an arbitrary database. We show that db|=sure q implies db |= Rewrite(F, q) (the converse direction is obvious). Assume db|= sure q. Since F is reifiable, we can assume a valuation θ over vars(~x) such that db|=sure θ(q). Then, for some valuation µ over vars(~ y)\vars(~x), R(θ(~x), µ◦θ(~ y )) ∈ db

and for every valuation ζ over vars(~z), if R(θ(~x), ζ(~z)) ∈ db, then there exists a valuation ω over vars(~ y) \ vars(~x) (1) such that ω ◦ θ(~ y ) = ζ(~z) and db|=sure ω ◦ θ(q 0 ). Notice that, since ~z is a sequence of distinct variables, every atom R(θ(~x), ~b) can be written as R(θ(~x), ζ(~z)) with ζ(~z) = ~b. Notice also that the foregoing reasoning relies on the fact that relation name R does not occur in q 0 (no self-join). We show hereafter that (1) is equivalent to db|=sure ζ ◦ θ(C ∧ q 0 ). Thus, for some valuation θ over vars(~x), for some valuation µ over vars(~ y) \ vars(~x), y)) ∈ db and R(θ(~x), µ ◦ θ(~ for every valuation ζ over vars(~z), if R(θ(~x), ζ(~z)) ∈ db, 0 then db|= sure ζ ◦ θ(C ∧ q ). 0 Since db|= v )), sure ζ ◦ θ(C ∧ q ) if and only if db |= ζ ◦ θ(C ∧ ϕ(~ it follows db |= Rewrite(F, q). Proof that (1) implies db|=sure ζ ◦ θ(C ∧ q 0 ). Assume (1). Since the truth of C does not depend on db, it suffices to show that ζ ◦ θ(C) is always satisfied and that for every y ∈ vars(~ y ) \ vars(~x), ω(y) = ζ(y). Assume that C contains zi = yi for some i ≥ 1. Then, by our construction, either yi is a constant or yi is a variable that occurs in h~x, y1 , y2 , . . . , yi−1 i. Also by our construction, if yi is a variable that does not occur in ~x, then yi ∈ vars(~z). We need to show ζ(zi ) = ζ ◦ θ(yi ).

(2)

Since ω ◦ θ(~ y ) = ζ(~z) by (1), it follows ω ◦ θ(yi ) = ζ(zi ).

(3)

Three cases can occur: • Case yi is a constant. Then ω ◦ θ(yi ) = yi . By (3), ζ(zi ) = yi , which implies (2). • Case yi is a variable that occurs in ~x. Then, ω ◦θ(yi ) = θ(yi ). By (3), ζ(zi ) = θ(yi ), which implies (2). • Case yi is a variable that occurs in hy1 , y2 , . . . , yi−1 i but not in ~x. We can assume an integer j ∈ {1, 2, . . . , i − 1} such that yi and yj are identical, and yj does not occur in h~x, y1 , . . . , yj−1 i. By our construction, zj and yj are identical. Since ω ◦ θ(~ y ) = ζ(~z) by (1), it follows ω ◦ θ(yj ) = ζ(zj ). Since yi , yj , and zj are identical, ω ◦ θ(yi ) = ζ(yi ). By (3), ζ(zi ) = ζ(yi ), which implies (2). For the reasoning hereafter, notice also that from (3) and ω ◦ θ(yi ) = ω(yi ), it follows ω(yi ) = ζ(zi ), hence ω(yi ) = ζ(yi ). We still need to show that for every y ∈ vars(~ y ) \ vars(~x), ω(y) = ζ(y). Assume i ≥ 1 such that yi is a variable that does not occur in ~x. We need to show ω(yi ) = ζ(yi ). Two cases occur: if yi occurs in hy1 , y2 , . . . , yi−1 i, then the desired result follows from the third case above; if yi does not occur in hy1 , y2 , . . . , yi−1 i, then zi and yi are identical, hence ω(yi ) = ζ(yi ) by (3). 0 Proof that db|= sure ζ ◦ θ(C ∧ q ) implies (1). Notice that our construction ensures that vars(~ y) \ vars(~x) ⊆ vars(~z). The desired result holds by choosing ω(y) = ζ(y) for each y ∈ vars(~ y ) \ vars(~x).

E.

PROOF OF LEMMA 5

We use the following helping lemma. Lemma 8. Let q be a Boolean conjunctive query without self-join. Let τ be a join tree for q. Let F ∈ q such that for τ

every G ∈ q \ {F }, G 6 F . Let rep be a repair of some database. Let A ∈ rep such that A is relevant for q in rep. Let B be key-equal to A and repB = (rep \ {A}) ∪ {B}. Then, Reify(q, F, repB ) ⊆ Reify(q, F, rep). Proof. The proof is obvious if A has the same relation name as F . Assume next that relation names in A and F are different. Let ζ be a valuation over keyVars(F ) such that repB |= ζ(q). Let U be the set of variables that occur in q. We can assume a valuation ζ + over U such that ζ + [keyVars(F )] = ζ[keyVars(F )] and ζ + (q) ⊆ repB . We need to show rep |= ζ(q), which is obvious if B 6∈ ζ + (q). Assume next B ∈ ζ + (q). Since A is relevant for q in rep, we can assume a valuation µ such that A ∈ µ(q) ⊆ rep. We can assume G ∈ q such that A, B have the same relation name as G. Let q 0 = q \ {G}. Let rep0 = repB \ {B} = rep \ {A}. Since q 0 contains no atom with the same relation name as G (no self-join), ζ + (q 0 ) ⊆ rep0 and µ(q 0 ) ⊆ rep0 . Moreover, ζ + [keyVars(G)] = µ[keyVars(G)], because A and B are key-

• Reify(q, F, rep) 6= {}. Then, we can assume a valuation θ over keyVars(F ) such that for each repair rep0 of db, rep0 |= θ(q). It follows db|=sure θ(q). • Reify(q, F, rep) = {}. Then, for every valuation θ over keyVars(F ), rep 6|= θ(q). It follows rep 6|= q. Since rep is a repair of db, db6|=sure q. Since db is arbitrary, it follows that F is reifiable in q.

G. PROOF OF THEOREM 3 We make use of the following helping lemmas, which have easy proofs. Lemma 9. Let q be a Boolean conjunctive query. Let τ be a join tree for q. Let x be a variable that occurs in q. Let q 0 be the query obtained from q by replacing each occurrence of x with some constant c. Let τ 0 be the graph obtained from τ by replacing, in all vertices, each occurrence of x with c (and by deleting x from each edge label). Then, 1. τ 0 is a join tree for q 0 ; and 2. if the attack graph of τ is acyclic, then the attack graph of τ 0 is acyclic.

τ

equal. Since G 6 F , we can assume an edge e with label L on the unique path between G and F such that L ⊆ G+,q . Since K(q \ {G}) = K(q 0 ), K(q 0 ) |= keyVars(G) → L. By Lemma 1, ζ + [L] = µ[L]. Let τG , τF be the two join trees obtained from τ by cutting the edge e with label L, where τG contains G, and τF contains F . This corresponds to the situation depicted in Fig. 6. Let κ be the valuation such that for every x,  µ(x) if x occurs in τG κ(x) = ζ + (x) if x occurs in τF Notice that if x occurs in both τG and τF , then, by the Connectedness Condition, x occurs in L, hence µ(x) = ζ + (x). Obviously, κ(q) ⊆ rep. It follows rep |= ζ(q). The proof of Lemma 5 is given next. Proof Lemma 5. The proof of the first item is straightforward. We next prove the second item. From Lemma 8 and by using induction on the length of the uniformisation, we obtain Reify(q, F, rn ) ⊆ Reify(q, F, sn ) ⊆

Reify(q, F, r) and Reify(q, F, s).

Assume A, B are key-equal and A 6= B, A ∈ rn and B ∈ sn . Then, A is not relevant for q in rn , and B is not relevant for q in sn . It follows that Reify(q, F, rn ) = Reify(q, F, sn ). Consequently, Reify(q, F, rn ) ⊆ Reify(q, F, s).

F.

PROOF OF COROLLARY 1

Proof Corollary 1. Let db be an arbitrary database. By Lemma 5, there T exists a repair rep of db such that Reify(q, F, rep) = {Reify(q, F, rep0 ) | rep0 repair of db}; the latter intersection is finite, because the number of repairs is finite. We distinguish two cases:

Lemma 10. Let q be a Boolean conjunctive query. Let τ be a join tree for q. Let F be a ground atom that belongs to q. If the attack graph of τ is acyclic, then q \ {F } has a join tree whose attack graph is acyclic. The proof of Theorem 3 is given next. Proof Theorem 3. The proof runs by induction on the length of q. The result is obvious for q = {}. For the induction step, assume q 6= {}. Assume the attack graph of τ contains no cycle. Let R1 (~x1 , ~y1 ), R2 (~x2 , ~y2 ), . . . , Rn (~xn , ~yn ) be a topological sorting of the atoms of q with respect to the attack graph. R1 (~x1 , ~y1 ) is reifiable by Corollary 1. Let ~v = hv1 , v2 , . . . , vn i be a sequence of distinct variables containing each variable in vars(~x1 ) ∪ vars(y~1 ). Let q 0 (~v ) be the nonBoolean conjunctive query whose set of atoms is q \ {R1 (~x1 , ~y1 )}. Let ~c = hc1 , c2 , . . . , cn i be a sequence of n new constants. Let q 0~v7→~c be the query obtained from q 0 by replacing each vi with ci . It follows from Lemmas 9 and 10 that q 0 ~v7→~c has a join tree whose attack graph contains no cycle. By the induction hypothesis, q 0 ~v7→~c has a certain firstorder rewriting ϕ. ˜ Let ϕ(~v ) be the formula obtained from ϕ˜ by replacing all occurrences of ci with vi (1 ≤ i ≤ n). Since ϕ(~v ) is a certain first-order rewriting for q 0 (~v), the desired result follows by Lemma 4.

Suggest Documents