Jul 17, 2008 - except the penultimate, corresponds to a particular form of incon- sistency ...... be shown that simplifications obtained by truth-preserving rewrit-.
Classifying Integrity Checking Methods with regard to Inconsistency Tolerance Hendrik Decker
Davide Martinenghi
Instituto Tecnol´ogico de Inform´atica, Valencia, Spain
Politecnico di Milano, Italy !" #%$&"
Abstract
From now on, let ‘method’ always mean an integrity checking method, unless specified otherwise, and let the symbol * always stand for a method. Due to the potential complexity of constraints, integrity checking can be prohibitively costly. To reduce costs, methods simplify constraints (7). To justify the correctness of their simplifications, all methods in the literature impose the total integrity principle, which requires that all integrity constraints be fully satisfied before the update. Thus, a method can be called ‘inconsistency-tolerant’ if it remains correct if this principle is infringed in practice. For instance, a primary key constraint on some database relation + may be violated by two tuples with the same key. That violation may be due, e.g., to a merging of heterogeneous data sources, for which integrity checking had been switched off (a common practice in data warehousing), or other irregularities. The violation infringes the total integrity principle, but does not impede integrity checking to accept the entry of any new tuple in + if and only if its primary key is unique. Rather than verifying or falsifying inconsistency tolerance for each individual method anew, it is desirable to have more general criteria by which entire classes of methods can be qualified in terms of inconsistency tolerance. To identify such classes is the main purpose of this paper. Section 2 refers to previous work. Section 3 surveys the rest of the paper. Section 4 provides preliminary notions. Section 5 revisits the case-based concept of inconsistency-tolerant integrity checking (11). Sections 6–10 introduce the classes of compositional, relevance-based, simplification-based, total-integrity-dependent and measure-based methods. Each is discussed w.r.t. its capacity of inconsistency tolerance. Section 11 concludes with a broader perspective and an outlook to future work.
We define and examine six classes of methods for integrity checking: case-based, compositional, relevance-based, simplificationbased, total-integrity-dependent, and measure-based ones. Each, except the penultimate, corresponds to a particular form of inconsistency tolerance. Inconsistency measures provide a new approach to integrity checking and inconsistency tolerance. For many methods, proofs or disproofs of their inconsistency tolerance become easier and more transparent by our classification. In general, a better understanding of inconsistency-tolerant integrity checking is achieved. Categories and Subject Descriptors ment]: Logical Design General Terms
H.2.1 [Database Manage-
Reliability, Theory
Keywords inconsistency tolerance, integrity checking, integrity constraints
1. Introduction Roughly, a computational method is inconsistency-tolerant if it works well also in the presence of inconsistent data. In this sense, very many computations that produce valuable results are inconsistency-tolerant, since data often are not in a perfectly consistent state. However, inconsistency tolerance becomes a critical property for methods based on formal logic, since propositional and predicate calculus are known to be intolerant against the least bit of inconsistency: the ex contradictione quodlibet principle justifies the dismissal of any output inferred from inconsistency. In particular, inconsistency tolerance is not gratuitous for integrity checking, i.e., for methods that evaluate whether the integrity constraints of a database will be satisfied after executing an update (22; 21). In (11), it is shown for several such methods that they either are inconsistency-tolerant or not. That is, the formers’ output can always be trusted while the latters’ cannot, in the presence of inconsistency. '
2. Previous work As far as the authors know, no study yet exists on inconsistencytolerant integrity checking, apart from related work by Decker and Martinenghi (11). Intuitively, a method is inconsistency-tolerant if it sanctions updates that preserve integrity, i.e., everything consistent before the update remains consistent afterwards, even if there are integrity violations before the update. If the update would cause an integrity violation that does not yet exist, a warning is raised or the update is rejected. The case-based definition of inconsistency tolerance in (11), as revisited in section 5, was the first one in the literature by which methods have been classified as inconsistency-tolerant. A straightforward application of that definition to concrete methods can be complicated by the methods’ technical details from which the definition abstracts away.
partially supported by FEDER and the Spanish MEC grant TIN2006( 14738-C02-01 supported by the Italian PRIN project New Technologies and Tools for the Integration of Web Search Services
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PPDP’08, ) July 15–17, 2008, Valencia, Spain. Copyright c 2008 ACM 978-1-60558-117-0/08/07. . . $5.00
195
- and - / , i.e., by a bipartite finite set of database clauses to be inserted or, resp., deleted. An integrity constraint (in short, constraint) is a first-order predicate logic sentence in some underlying language, represented either as a denial (i.e., a clause without head whose body is a conjunction of literals) or in prenex normal form (i.e., quantifiers outermost, negations innermost). An integrity theory is a finite set of constraints. . From now on, let the symbols - , IC, and 1 always denote a database, an integrity theory, an update and, resp., a constraint. By a slight abuse of functional notation, let -02 IC 34 sat and -0251&364 sat denote that IC or, resp., 1 is satisfied in - , and -02 IC 364 vio ( -0251&3&4 vio) that it is violated. ‘Consistency’ and ‘inconsistency’ are synonymous with ‘satisfied’ and, resp., ‘violated’ integrity. We assume that the semantics of integrity is two-valued, compositional, symmetric and categorical, as defined by a mapping from some class of tuples 27-98 IC 3 to a set of integrity values. Twovalued means that the only integrity values are sat and vio. Compositionality means that, for each - and each IC, -02 IC 3&4 sat if and only if -0251&3&4 sat for each 1;: IC, and -02 IC 364 vio if and only if -0251&364 vio for some 1=@? . For each triple 27-98 IC 8 3 , the value of *A=B?27-98 IC 8 3 is computed by plainly evaluating - / 251&3 for each . 1B%27. -98 IC 8 364 ok. may of When computing *D27-98 IC 8 3 compositionally, . * course stop and output ko as soon as *D27-98#168 3m4 ko, for some 1;: IC. Although clever sequencing may obviously result in better performance, the particular sequence by which a compositional computation would evaluate the constraints of an integrity theory is immaterial in this paper. In fact, many known methods are compositional. The following theorem, which identifies compositionality as a necessary condition for inconsistency tolerance, can be shown by applying the definitions.
b) If 1 is of the form HJI or KMLN2 where I is a conjunction of literals, K is a vector of , -quantifiers of all global variables in 1 and L is a formula in prenex normal form 3 , and if O is a substitution of the global variables in 1 , then HJIMO or, resp., ,!2PLQO3 is called a case of 1 , where ,!2R3 denotes universal closure. If O is a ground substitution, then HSITO or, resp., LQO is called a basic case. c) Let UVW7X&V#Y27-98 IC 3 , resp., Z[ \%X&V#Y27-98 IC 3 denote the set of all cases ] of constraints in IC such that -02P]^34 sat, or, resp., -02P]^34 vio. . d) * is called case-based if, for each triple 27-98 IC 8 3 and each case ]_:MU`VW7X&V#Ya27-98 IC3 , the following holds. . (2) *D27-98 IC 8 3b4 ok E - / 2P]^3C4 sat E XAMPLE 1. 1c4dHfeg25Fg8ih63 , ej25Fg8ik3 , hTm l4 k is a primary key constraint on the first column of relation e . The foreign key constraint 1na49,oF ,hpG&k25q625Fr8ih63Cstej25ho8uk`3u3 on the second column of relation of 1 are global. The q references the primary key of e . All variables . global variables of 1n are F and h . For 4 insert q&25vo8uw3 , most only evaluate the basic case G6k625q625vo8iwx3yszej25w8 k3u3 methods * of 1 n , or rather its. simplification G6kej25w8uk3 , obtained since q&25vo8uw3 is made true by . If, for instance, 25w8uw3 and 25w8u{3 are rows in e , outputs ok, ignoring all irrelevant violated cases such as, e.g., * Hfej25w8uw3 , ej25w8i{x3 , w|} l4 { and 1 , i.e., all extant violations of the primary key constraint. * is case-based if it always ignores irrelevant violations. If there . is no tuple matching 25w8 k`3 in e , then * outputs ko, indicating that causes a violation of 1 n . It is easy to see that each case-based method is correct, in the sense of definition 1, i.e., case-based inconsistency-tolerant integrity checking generalizes the traditional approach which insists on total integrity. The gains are indeed formidable: even in the presence of (obvious or hidden, known or unknown) violations of integrity (which, in practice, occurs frequently), transaction processing may continue, while the integrity of all cases that satisfy the constraints is maintained. Surprisingly, not just *A=@? , but in fact most, though not all known methods for integrity checking are case-based. For a representative selection of methods, as described in (22; 20; 25; 7; 15), their inconsistency tolerance or lack thereof is assessed in (11). The respective proofs enter into technical details of each particular method. As already indicated, one of the purposes of this paper is to provide conditions by which the inconsistency tolerance of whole classes of methods can be determined without having to go into the particular details of each class member.
T HEOREM 1. If *
is case-based, then *
is compositional.
The converse of theorem 1 does not hold. Example 2 shows that the method described in (15), here denoted by *
, is not inconsistency-tolerant. To see this, note that, for each * and . IC is unitary, as in example 2, each triple . 27-98 IC 8 3 such that . *~27-98 IC 8 3C4*>@`27-98 IC 8 3 . E XAMPLE 2. The constraint 194JHej25Fg8ih63pq&27k3pFg;k&@ reason that there is . cannot cause no value k of q in the interval [1, 5], and hence, any violation . of 1 , because [2, 4] is. contained in [1, 5]. Hence, * 27-98a16`8 34*>@ 27-98a16`8 34 ok. While this output would be correct under. the premise of total integrity in - , it wrongly indicates that would not violate any case of 1 that is satisfied in - . Thus, neither * nor *tB is case-based.
6. Compositionality As postulated in section 4, the semantics of integrity is compositional, i.e., -02 IC 3 = sat if and only if -0251&3 = sat, for each 1;: IC.
197
6.2
. E XAMPLE 4. For virtually any method * , and - , 1 , 1 n and as in example 1, 1 n is . relevant for insertions of tuples into q . Thus, ¡o¢ ¤ ¥ ( - , 1 , 1 n , ) 41 n . 1 n also is relevant for deletions of . tuples in e . Clearly, 1 is not relevant for , nor for any deletion. 1 only is relevant for the insertion of tuples into e . (See Appendix A for more details.) . Usually, ¡£¢ ¤¥27-98 IC 8 3 is determined without accessing - . Otherwise, the efficiency gained by focusing on relevant constraints may be compromised. If - is not relational but deductive and IC involves view predicates, then also the view definitions of - may be accessed, but never its so-called extensional part. So, the cost . of determining ¡£¢ ¤¥27-98 IC 8 . 3 is negligible, as opposed to evaluating it. Hence, ¡£¢ ¤ ¥ 27-98 IC 8 3 provides a valuable benchmark for the efficiency of * : the smaller this set is, on average, the better is * . However, such considerations of efficiency are only of secondary interest in this paper, since its results are independent thereof.
Compositional inconsistency tolerance
Although, as seen in example 2, not each compositional method * is inconsistency-tolerant, a particular kind of inconsistencytolerance can be achieved by * B , as defined below. D EFINITION 4. A method * is called compositionally inconsistency-tolerant if *tB is case-based. It is easy to show that compositional inconsistency tolerance is a weak form of case-based inconsistency tolerance. T HEOREM 2. If * is case-based, then * consistency-tolerant.
is compositionally in-
The converse of theorem 2 does not hold. A counter-example is the non-compositional method in (7), denoted here by *
. In (11), * r has been observed to be compositionally inconsistency-tolerant but not inconsistency-tolerant. However, we have the following equivalence. T HEOREM 3. * is case-based if and only if * and compositionally inconsistency-tolerant.
7.1
is compositional
Relevance-based methods
The following definition captures a class of methods, the essence of which is to focus on relevant constraints.
. Proof. Since. * is compositional, *D27-98 IC 8 364 ok entails *tB`27-98 IC 8 364 ok. Thus the if-half of theorem 3 follows, since *tB is case-based. The only-if-half follows from theorems 1 and 2. None of the two conditions in theorem 3, that together entail inconsistency tolerance, implies the other. Example 2 shows that there are compositional methods that are not compositionally inconsistency-tolerant: *>B is not compositionally inconsistency-tolerant. (To. see this, mind that, for each method * and each triple . 27-98 IC 8 3 such that . IC is unitary, as in example 2, *D27-98 IC 8 3b4* B 27-98 IC 8 3 .) Example 3 shows that not each compositionally inconsistency-tolerant method is compositional. E XAMPLE 3. Let -4q625v63 , IC 4HJq&25F3 , HJq&25v63 , + 25Fo3 . and 4 insert + 25w3 . As already noted, *r is compositionally
D EFINITION 5. * is called relevance-based if, for each triple . 27- , IC, 3 , the following holds. . . . *~27-98 IC 8 3b4 ok EN*A=@?27-98x¡£¢ ¤ ¥ 27-98 IC 8 3#8 3b4 ok 25`3
7. Relevance
Note that condition (4) is defined without the total integrity premise. (If, additionally, -02 IC 3&4 sat were required, each method would be relevance-based.) Further, def. 5 does not insist that * . should evaluate ¡£¢¤¥¦27-98 IC 8 3 brute-force; it just says that the results are the same. can be restricted, so as . to beOn the other hand, each * come a relevance-based method, by defining *§ ¨i©@27-98 IC 8 3 : 4 . . *A=@?27-98x¡£¢ ¤¥ . 27-98 IC 8 3#8 3 , which means that *A§ ¨i© evaluates ¡o¢ ¤ ¥ 27-98 IC 8 3 brute-force. In general, relevance-based methods are not inconsistencytolerant: in example 2, * wrongly identifies . the only constraint 1 as not relevant, i.e., ¡£¢ ¤ª¥d«|27-98a16`8 34z¬ . Hence, . . * § ¨i© 27-98a168 34A*D27-98#168 3y4 ok. Hence, * § ¨i© is not case-based. However, we have the following result. To state it, let the set a1: IC ®6¯i°£± + ±^²@³T]>:UVW7X&VY. 27-98#163µ´>Z[ \%X&VY27- / 8#a163 be abbreviated by U`VWiZ[ \627-98 IC 8 3 .
The easiest and often most effective way to speed up integrity checking is to take a single step beyond a brute-force evaluation. That step consists in distinguishing between constraints that are relevant, i.e., potentially violated by a given update, and those that are not. For many updates, this step allows to ignore most or even all integrity constraints, if they are irrelevant. In fact, most methods focus on relevant constraints. Perhaps the earliest reference to emphasize the importance of relevance for integrity checking is (26). The most common criterion for identifying a constraint to be relevant is based on the occurrence of a literal in the constraint which matches one of the updated tuples. (See Appendix A or (22) (20) for more precise definitions; note, however, that our results are independent of such details.) Another criterion (used, e.g., in (15) (7)) is that a constraint is subsumed or contained by another one and therefore can be ignored. . . and 27-98 IC 8 3 , let ¡£¢ ¤¥27-98 IC 8 3 denote the set of For * relevant constraints as determined by. * . If * is void of a suitable concept of relevance, ¡o¢ ¤¥27-98 IC 8 3 can simply be set equal to IC.
T HEOREM 4.. * is case-based if * is relevance-based and, for each 27-98 IC 8 3 , the following holds: . . ¡£¢ ¤ ¥ 27-98 IC 8 3·¶UVWiZ[ \627-98 IC 8 3 2P% 3 . Proof. Let * be relevance-based and, for each 27-98 IC 8 3 , (5) . ~ * 7 2 9 8 8 · 3 4 be satisfied. Further, let IC ok, for some triple . 27-98 IC 8 3 , and ]¸:UVW7X&V#Y27-98 IC 3 . We have to show that - / 2P]^3m4 sat. Suppose that - / 2P]^3 4 vio. Then, there is an 1: IC of which ] is a case, such that - / 251&3¹4 vio. Moreover, since -02P]^3¹4 sat, ]º:MU`VW7X&. V#Y27-98a163 holds. Hence, by (5), it follows that. 1