Satis ability of Integrity Constraints: Re ections on a Neglected Problem Rainer Manthey ECRC, Arabellastr. 17, 8000 Munchen 81
[email protected]
1 Introduction Since the relational model of data emerged in the early seventies, a considerable amount of work has been devoted to monitoring database integrity. The majority of papers addressing this topic are dealing with static (or state) constraints: axioms de ning properties of a database that are expected to remain invariant under database updates. The main problem discussed is how to eciently check whether all static constraints are satis ed in a given database state, or whether they remain satis ed after an updating transaction, respectively. However, a fundamental prerequisite for any kind of integrity checking is hardly ever mentioned in any paper: it only makes sense to check whether constraints are satis ed in a particular state, if they are satis able in any state at all. Most authors either tacitly assume satis ability or completely neglect the problem, although inconsistencies among constraints are neither unlikely to occur, nor easy to detect. In fact, determining whether a given set of integrity constraints is satis able or not is an undecidable problem, as it amounts to deciding consistency of a set of logical axioms. One might be tempted to exclude those constraints for which satis ability cannot be decided. However, decidable classes identi ed by logicians are de ned according to syntactical criteria which are far too strict as they are excluding many semantically reasonable constraints the usefulness of which can be well motivated by practical examples. Because of this, people often tend to conclude that nothing can be done - unsatis ablity has to be accepted as an inevitable danger, that has to be expected according to some obscure theoretical results, but will hopefully never be encountered in practice. However, there is no need to be discouraged or even fatalistic about undecidable problems! The fact that satis ability is undecidable means that in certain cases the checking process will not terminate, whatever algorithm has been chosen. In the vast majority of cases, however, satis ability can be decided! We will discuss later in this paper how one can characterize those unfavorable cases, where satis ability checking might actually run forever, and how one can deal with this kind of situation in practice. For decidable cases, the problem of complexity arises: very often the constraints under consideration belong to a class of formulas for which deciding satis ability is NP-hard. But 1
again such a negative result is not disqualifying the respective class of formulas as a whole: there are sets of formulas for which deciding satis ability requires time which is exponential with respect to the size of the set, but in the vast majority of cases this task can well be performed with reasonable (i.e., polynomial) eort. Thus, we claim that theoretical limitations like undecidability or NP-hardness should not be considered as principal obstacles, but merely as a kind of warning reminding us of the existence of certain intractable instances of the problem. In this paper we rst of all would like to motivate the necessity of providing means for satis ability checking in a database management system supporting any reasonably general form of integrity constraints. We will show how easy it is to end up with unsatis able constraints while designing an application, e.g. by simply forgetting certain exceptional cases. Subsequently, we will outline a practical approach to satis ability checking that we regard as particularly well suited in a database context. This approach is based on a model generation paradigm, which has been successfully applied in the area of automated theorem proving. Automatically generating models of a set of constraints can be viewed as constructing example databases which satisfy the constraints under consideration. Automated model generation can be very conveniently implemented in a logic programming language like PROLOG and can be applied equally well for relational as for deductive databases. As mentioned above, little work has been done on satis ability checking in the database area up till now. This is the reason why the list of references at the end of this paper will look rather limited to some readers. The fact that the author of this paper is involved in most of the articles cited should also not be taken as a sign of vanity, but as re ecting the current situation as far as literature is concerned. A rst proposal for applying a particular theorem proving method for checking consistency of constraints is due to Kung [Ku85]. In [BM86] the problem was discussed for the rst time in a major international conference, the particular importance of nite satis ability was pointed out, and several dierent theorem proving paradigms were analysed with respect to their suitability in the database context. As a result of this analysis, we developed our own approach to consistency checking extending the classical tableaux method, which Kung had used in his proposal. This approach has been presented to the theorem proving community on the national [MB87] as well as on the international level [MB88]. Part of [BDM88] provided a brief introduction to a particular implementation of the model generation approach making use of integrity checking techniques. However, a serious discussion of the satis ability problem in the database context is still missing. The present paper is intended as providing a rst step towards such a presentation. The reader should not expect to nd any previously unpublished results, new or improved methods, or the like in this paper. Thus the paper will not be innovative! The emphasis in the forthcoming chapters will rather be on discussing the relevance and commenting on the consequences of those results which have been previously published.
2
2 Some De nitions and Conventions Ahead Before motivating the usefulness of satis ability checking in databases, let us shortly introduce some notational conventions and recall some notions from logic. Throughout this paper, we will represent integrity constraints as closed formulas of a functionfree rst-order calculus. A formula is closed, if all its variables are bound by quanti ers. The only restriction we impose on constraints is that they be range-restricted (or safe). A very convenient way of guaranteeing that every formula is safe is by using restricted quanti ers. A closed formula is restricted if all its quanti ed subformulas have one of the following forms:
8x1; :::; xn : :R
8x1; :::; xn : R ) Q
9x1; :::; xn : R
9x1; :::; xn : R ^ Q
where R is a (conjunction of) positive literal(s) containing each of the x and Q is an arbitrary formula. In the following we assume that all constraints are closed formulas with restricted quanti ers. There are two alternative ways of de ning satis ability of a set S of logical formulas, a model-theoretic and a proof-theoretic one. i
S is satis able if it has a model, i.e., an interpretation in which each of the formulas is true S is satis able if it is not possible to derive any contradictory formulas F and :F from S
Both de nitions are equivalent, but they give rise to rather dierent kinds of algorithms for satis ability checking, "optimistic" ones trying to construct models, and "pessimistic" ones trying to derive contradictions. It should be noted that database states satisfying all integrity constraints represent models of the constraint set. Such states are often called consistent states in database terminology. In logic, consistency is a synonym for satis ability (often used in combination with the prooftheoretic de nition). If speaking of 'consistency checking', database people usually mean 'checking whether a given state is consistent', i.e., checking constraint satisfaction. Be aware, this article is concerned with constraint satis ability, which is a necessary prerequisite of satisfaction. Nevertheless, speaking of consistent states is logically correct, because the union of a satis able set of constraints and of a set of facts (stored or derivable) satisfying these constraints is a consistent set of formulas. The problem we are concerned with in this article is to determine consistency of the constraint set alone, independent of any particular set of facts. These remarks are necessary, because the ambiguous use of 'consistency' in connection with constraints may be very puzzling and even misleading. 3
3 Unsatis able Constraints: Are They Likely to Occur? In the past, we rather often were confronted with the opinion that constraint satis ability is an unimportant problem, as it will hardly ever happen in practice that a constraint set is unsatis able. Another frequent argument is that integrity constraints have been around for so long time by now, but nobody has ever complained about inconsistencies. So why worry about satis ability at all? Let us rst address the second argument: no trouble with unsatis ability in practice up till now! When looking at commercially available database systems, one can observe that integrity constraints are either not oered at all, or that only very restricted forms of constraints are available: functional dependencies, keys, typings, inclusion dependencies, join dependencies, and the like. When calling this kind of constraints 'very restricted', we don't doubt there usefulness and practical relevance. However, compared with the constraint concept we are considering here - closed rst-order formulas in full generality - the abovementioned classes of dependencies constitute rather small particular subcases. As far as satis ability is concerned, dependencies are uncritical. They exhibit a very similar structure if formulated as logical formulas, being all expressable as universally quanti ed implicational formulas of the form 8x1; :::; xn : R ) Q.
Any set of formulas of this kind is satis able. Every dependency is satis ed at least in
the "empty" model - where every atomic ground formula evaluates to false. If we are not content with such a trivial, but in principle fully valid solution, we can easily obtain nontrivial models by choosing an arbitrary set of facts and "completing" it by adding those facts required by tuple-generating dependencies and identifying constants according to equalitygenerating dependencies [Ull88]. As an example, take the three following dependencies: 1. an inclusion dependency: 2. a functional dependency: 3. a type constraint:
8X
: manager(X ) ) employee(X ) 8X1; X2 ; Y : leads(X1 ; Y ) ^ leads(X2; Y ) ) X1 = X2 8X; Y : leads(X; Y ) ) manager(X ) ^ department(Y )
Let us start from an initial set of facts fleads(e1,d1), leads(e2,d1)g. The equality-generating functional dependency requires that e1 and e2 be identical in any model. Let us take e1 as a representative. In order to have the third constraint satis ed, the model has to contain facts 'manager(e1)' and 'department(d1)' as well. Finally, in order to satisfy the inclusion dependency we need 'employee(e1)'. The resulting set of facts fleads(e1,d1), manager(e1), department(d1), employee(e1)g is a model of the three dependencies. In a similar way, we may obtain non-trivial models for any kind of dependencies starting from some "guessed" initial facts. However, as there is no existential constraint forcing us to have any facts at all, the empty model suces for demonstrating satis ability. 4
In order to come up with an unsatis able constraint set, at least one such (positive) existential condition and at least one negative condition, completely excluding a certain constellation for every database state, are required. Neither of them can be expressed by means of a dependency. Let us now consider another (more elaborate) example containing both, an existential and a negative constraint. 1. 2. 3. 4. 5. 6.
8X; Y
: leads(X; Y ) ) member(X; Y ) 8X; Y; Z : member(X; Z ) ^ leads(Y; Z ) ) works:for(X; Y ) 8X : employee(X ) ) (9Y : department(Y ) ^ member(X; Y )) 8X : department(X ) ) (9Y : employee(Y ) ^ leads(Y; X )) 9X : employee(X ) :9X : works:for(X; X )
These constraints have a perfectly natural semantics, describing some of the basic laws to be found in any enterprise. When re ecting a while about the compatibility of the six constraints, one will probably conclude that there shouldn't be any problem concerning satis ability. Nevertheless, the six above constraints can be shown to be unsatis able, which might come as a surprise to the reader. Every model of our example constraints at least has to contain an employee - say e1 - in order to satisfy constraint 5. Furthermore, there must be a department - say d1 - to which e1 belongs, in order to satisfy constraint 3. In addition there has to be another employee - say e2 - leading d1, in order to satisfy constraint 4. According to constraint 1, e2 must also be a member of d1, as he leads this department. Now, in order to have constraint 2 satis ed, e2 has to work for himself, as he is both member and leader of the same department. Up till now we have collected facts that necessarily have to be in any model, if one exists at all. However, the negative constraint 6 excludes a situation where anybody works for himself. Obviously, in our constraint set we have ve constraints categorially requiring that e2 works for himself, and one constraint categorially excluding it: a prototypical case of incompatibility or unsatis ability! What can be the reason for such an inconsistency so easily escaping a readers attention? An exceptional case has been overlooked when designing constraint 2: every member of a department works for the boss of the department, except the boss himself. Thus constraint 2 should be: 8X; Y; Z
: member(X; Z ) ^ leads(Y; Z ) ^ X 6= Y
) works:for(X; Y )
We think that this very small, but perfectly realistic example illustrates in a rather impressive way, how easy it is to introduce contradictory constraints even in a setting which seems to be ridiculously simple. In case of several hundred constraints, which can be easily expected in a real-life application, complexity will increase even more because of the sheer size of the problem. However, we would like to point out, that size of the constraint set is neither a necessary, nor the only reason why satis ability checking appears to be recommendable. Experience from the theorem proving domain clearly shows, that very big sets of formulas are often very easily checked for consistency, whereas extremely small examples may be nearly untractable. 5
4 Satis ability Checking by Model Generation 4.1 Why to choose Model Generation? In principle, any kind of theorem prover is a potential candidate for serving as a satis ability checking component inside a DBMS. Why do we favor a model-generation approach? We mentioned earlier that model generation can be characterized as an "optimistic" approach to satis ability checking. If the set of formulas under consideration is in fact satis able, it is sucient to construct a single model for demonstrating this property. In case all possible ways that might lead to a model have to be discarded due to a contradiction, unsatis ability is shown. Thus, a single solution in case of satis ability contrasts with an exhaustive search in case of unsatis ability if a model generation approach is used. As an alternative one could apply a theorem prover trying to derive contradictory logical consequences. Here having found a single contradiction is sucient for concluding unsatis ability. In the satis able case, however, an approach based on derivation of logical consequences has to exhaust all possible ways towards a contradiction. Compared to the model generation approach the situation is just the other way round: a single solution suces in case of unsatis ability, exhaustive search is required in case of satis ability. If proving theorems is the purpose, it might be recommendable to apply a procedure which will quickly lead to an answer in case of unsatis ability. This is because in a refutation proof one tries to show that the axioms and the negation of the theorem are unsatis able. In a database context, preferences should be dierent. As a particular constraint set has most likely been designed correctly, i.e., is likely to be satis able, a procedure able to quickly demonstrate satis ability should be preferred. This is a rst reason to prefer model generation for our purposes. Another, more pragmatic reason is that models and databases are very close to each other, as mentioned above. A model is a small "arti cial" database state, and we conjecture that it might be much easier for the constraint designer to deal with an algorithm that constructs example databases than with one deriving logical consequences. Furthermore, additional properties of the designed constraints (other than satis ability) might be easily checked or discovered if inspecting the models constructed by such an algorithm. In particular, a model generation algorithm might serve as a core component of a more sophisticated design tool helping a database designer to understand better which are the implications of his design.
4.2 Principles of Automated Model Generation The key idea of our approach to model generation is to view constraints as a kind of production rules. Implications are exploited for "driving" the generation of facts, existential quanti ers serve as a means for the generation of new individuals. Instead of presenting a detailed description of our algorithm - which has been given in [MB88] and [BDM88] - we will try to give the reader some intuition about the approach. For doing so, we will rst of all exhibit the basic principles by discussing how model generation works for the six example constraints introduced above. 6
For this purpose, let us rephrase the constraints as production rules: 1. 2. 3. 4. 5. 6.
leads(X; Y ) ! member(X; Y ) member(X; Z ) ^ leads(Y; Z ) ! works:for(X; Y ) employee(X ) ! department(d(X )) ^ member(X; d(X )) department(X ) ! employee(b(X )) ^ leads(b(X ); X ) true ! employee(e) works:for(X; X ) ! false
Observe that we have replaced existentially quanti ed variables by functional terms, taking as arguments all those universal variables which are occurring free in the respective existential formula. This process is called Skolemization and is a standard means of expressing existential quanti cation in clausal logic used in a theorem proving context. Furthermore, we write all constraints in an implicational manner. This is possible because a positive formula P is logically equivalent to an implication true ! P , and a negative formula :P is equivalent to P ! false. Thus we have expressed all constraints in a uniform setting, which permits us to generate models by executing the following algorithm: A. Start with an empty database, where every formula is false! B. Choose a production rule A ! B , such that the condition part A evaluates to true and B evaluates to false over the current database! If no such rule can be found then stop and report satis ability, else if B is a single fact then add B to the database, else if B is a conjunction of facts then add each of them. C. If false is in the database then stop and report unsatis ability else goto B. It is crucial for the completeness of the algorithm that the choice of rules to be applied in step B is made according to a "fair" strategy, where every rule which is applicable in principle, nally is applied by the algorithm. A very simple such strategy is the levelsaturationstrategy, adding all facts producable according to step B to the example database under construction simultaneously. Fairness of strategy and level saturation are well-established terms in theorem proving. Applying the algorithm to our six example production rules using a level saturation strategy will result in the following sequence of levels (where true represents the empty database):
true employee(e) 7
department(d(e)) member(e,d(e)) employee(b(d(e))) leads(b(d(e)),d(e)) member(b(d(e)),d(e)) works.for(b(d(e)),b(d(e)))
false In case we had chosen the corrected version of constraint 2, i.e., 2.
member(X; Z ) ^ leads(Y; Z ) ^ X 6= Y
! works:for(X; Y )
the process would have stopped after the fourth iteration, not producing 'works.for(b(d(e)), b(d(e)))' and consequently not running into false. In this case, the six facts generated by the algorithm would represent a model of the six constraints. This model would not be a minimal one, as employee e might play the role of department leader as well, and thus employee b(d(e)) could be saved. The model would be minimal up to possible identi cation of individuals, however. If we had chosen an "unfair" strategy, e.g. a Prolog-like strategy always choosing the rst applicable rule in textual order, generation would not terminate due to an oscillation between rules 3 and 4. It is very easy and convenient to implement such a generation process in Prolog as shown in [MB88] or [BDM88]. Each model will usually be small enough to t into Prolog's main memory database. Query evaluation over such a database is implemented by ordinary Prolog goal evaluation (replacing ^ by ','). In case of a disjunctive constraint such as, e.g., 8X
: employee(X ) ) male(X ) _ female(X )
the model generation process will be split into (in this case) two alternative subprocesses, one where the respective 'male'-fact is added to the database, and one where a 'female'-fact is added instead. The two branches of the resulting search tree are explored one after the other, driven by Prolog's backtracking facility. Rule selection, however, should not be left to the Prolog interpreter but has to be meta-programmed in order to avoid looping due to unfairness (see above!). If applied interactively, such a model generation algorithm can, e.g., be exploited in the context of a graphically supported design tool, where the growing (and shrinking) of models is visualized during the generation process. 8
If constraints have been designed together with derivation rules, i.e., if a deductive database is concerned, the satis ability problem is even more urgent, as contradictions may not only arise between constraints, but both constraints and rules have to be checked for satis ability. The model generation process outlined in this section will be exactly the same. However, when adding a fact to the "model database" under construction, additional implicit facts will become derivable which have to be considered during query evaluation as well. This is most easily accomplished, if derivation rules are directly expressed as Prolog rules.
5 How to Deal With Undecidability In the introduction, we stated that satis ability checking may run forever in certain cases, due to the undecidability of the problem. Let us describe in a more detailed way, how and when termination problems can arise. First, we can be sure to get an answer in nite time in case our set of constraints is unsatis able. Unsatis ability is a semi-decidable property, i.e., termination is guaranteed in case the property holds, but not if it does not hold. This means, that cases of non-termination have to be expected only if the constraint set is satis able. A second theoretical result helps in further narrowing down problematic cases: nite satis ability, i.e., the property of a set of formulas to admit nite models, is semi-decidable as well. Thus only one problematic class of formulas remains, namely those whose models are all in nite. Such sets of formulas are called "axioms of in nity". Again it might happen very easily that such an "axiom of in nity" occurs in practice. Just consider the following ve constraints: 1. 2. 3. 4. 5.
8X
: employee(X ) ) (9Y : works:for(X; Y )) :9X : works:for(X; X ) 9X : employee(X ) 8X; Y; Z : works:for(X; Y ) ^ works:for(Y; Z ) ) works:for(X; Z ) 8X; Y : works:for(X; Y ) ) employee(X ) ^ employee(Y ))
Although again looking perfectly reasonable at rst glance, the ve constraints can only be satis ed in a database containing an in nite hierarchy of employees, because once more an exceptional case has been forgotten: the top-manager of the enterprise does not work for anybody else anymore. Our algorithm for model generation will consequently generate one employee fact after the other, independent of the strategy chosen! There is no possibility how to enhance this algorithm in such a way that in nitely repeated patterns like
::: ! employee(e ) ! works:for(e ?1 ; e ) ! employee(e +1) ! ::: i
i
i
i
are detected in all cases. This is what undecidability means in practice: even the cleverest algorithm for loop checking will not be complete! If using our model generation algorithm interactively, a constraint designer will however observe the ever-repeating pattern of the 9
generation process in most cases, and will immediately interrupt generation once he is suf ciently sure that an "axiom of in nity" is hidden among the constraints he has designed. However, in theory ...! There is one problem left, that may cause an in nite generation even in cases where nite models do exist. When reformulating our six example constraints above, we have replaced every existentially quanti ed variable by a functional term. Doing so leads to the generation of a new term each time the respective production rule is recursively applied during model generation. It is only by checking whether a term already introduced before might be used for satisfying the respective existential formula that a nite model can be found in such cases. In [BM87] we have investigated the implications of such enhancements of our basic model generation procedure. The price which has to be paid for a systematic check for reusability of already existing terms is high. Therefore, in practice, one should try to do without such a check rst. In many cases doing so might be sucient - as was the case in our example above, where we didn't check whether employee e might serve as a leader of department d(e), but introduced a "new" employee b(d(e)) straightaway.
6 Conclusion It was the purpose of this paper to make the reader aware of the fact that designing integrity constraints in a relational (or deductive) database might be a risky aair. As soon as constraints are more expressive than just dependencies, which we regard as absolutely necessary, satis ability of constraints becomes a problem and has to be checked when designing constraints (and possibly rules). We have discussed theoretical and practical problems related to satis ability checking and outlined a model generation approach on which a Prolog implementation is based. This approach had been successfully applied for solving problems in theorem proving in the past and seems to be particularly well suited to checking satis ability of database constraints.
References: [BDM88]
[BM86] [BM87] [MB87]
F. Bry, H. Decker, R. Manthey "A Uniform Approach to Constraint Satisfaction and Constraint Satis ability in Deductive Databases", Proc. 1st Intern. Conf. on Extending Database Technology (EDBT), Venice, 1988, Springer LNCS Vol. 303, pp. 488-505 F. Bry, R. Manthey "Checking Consistency of Database Constraints: a Logical Basis" Proc. 12th Intern. Conf. on Very Large Data Bases (VLDB), Kyoto, 1986, pp. 13-20 F. Bry, R. Manthey "Proving Finite Satis ability of Deductive Databases", Proc. 1st Workshop on Computer Science Logic (CSL), Karlsruhe, 1987, Springer LNCS Vol. 329, pp. 44-55 R. Manthey, F. Bry "A Hyperresolution-Based Proof Procedure and its Implementation in Prolog", Proc. 11th German Workshop on Arti cial Intelligence (GWAI), Geseke, 1987, Springer IFB 152, pp. 221-230 10
[MB 88] [Ku85]
[Ull88]
R. Manthey, F. Bry "SATCHMO - A Theorem Prover Implemented in Prolog", in: Proc. 9th Int. Conf. on Automated Deduction (CADE), Chicago, 1988, Springer LNCS Vol. 310, pp. 415-434 C.H. Kung "A Tableaux Approach for Consistency Checking" IFIP WG 8.1 Working Conf. on Theoetical and Formal Aspects of Information Systems, Sitges, April 1985, A. Sernadas [ed.], North Holland Publ. J.D. Ullman Principles of Database and Knowledge-base Systems, Vol. 1, Computer Science Press, 1988
11