Semantic Guidance for Saturation-Based Theorem ... - Semantic Scholar

2 downloads 0 Views 110KB Size Report
The new generation of Scott rests on a new understanding of semantic ... the proliferation of unprovable subgoals, which may simply be avoided if they are ...
Semantic Guidance for Saturation-Based Theorem Proving Kahlil Hodgson and John Slaney Australian National University, Canberra 0200, Australia {Kahlil.Hodgson, John.Slaney}@anu.edu.au

1

Introduction Question: What semantically oriented strategy can direct a reasoning program in its choice of clauses to which to apply the inference rule(s) being employed: what properties can be used other than the current criterion of simply finding a clause containing a literal with the appropriately signed predicate? L. Wos [10] (problem #5)

In [10] Wos identifies “inadequate focus” as one of the primary obstacles to effective theorem proving. Saturation-based provers, which work by extending initial derivation fragments forward from assumptions, typically suffer from the fact that most of these fragments are redundant at best and at worst incapable of extension to any proof. Yet standard rules of inference such as resolution or paramodulation apply equally to all derivation fragments, and standard heuristics such as short clause preference make no distinction relevant to that between the useful consequences and the dross. In this paper we report one line of attack on the focus problem for saturation methods of first order theorem proving, by injecting semantic information into heuristics for ordering the possible inferences. Preliminary work on this idea,1 in collaboration with Lusk, McCune and others, resulted in the system Scott [5, 1, 7] which showed some modest efficiency gains relative to its parent Otter. However, the main technique used in that prover was model resolution; the work on false preference (see below) remained unsystematic and lacked a theoretical basis. The new generation of Scott rests on a new understanding of semantic guidance and shows remarkably stable behaviour over a wide range of problems. We present results on problems from the TPTP library and performance under fair conditions in CASC as compelling evidence that the effects exploited by our technique are real and useful. The structure of the present paper is as follows. In §2 we explain the logic underpinning semantic guidance as used by Scott. In §3 we briefly describe the implementation, and in §4 we report investigations of its behaviour over a variety of problems, noting its results in CASC in comparison with other provers, and detailing the design suite used in the development of Scott.

2

Theory

2.1

Consistency

Suppose we had an oracle that answered questions of the form: is first-order formula F satisfiable? How could this be useful in proof search? For decompositional methods (tableaux and the like) the answer is obvious: much of the complexity of those methods arises from the proliferation of unprovable subgoals, which may simply be avoided if they are known in advance to be unprovable: in the jargon of proof theory, the oracle would make all rules invertible. For bottom-up reasoning methods, however, the answer is less obvious. 1

Ours is not, of course, the only approach represented in the literature. Semantically based refinements of resolution go back at least to Slagle’s early work [4]. Mention should also be made of Plaisted’s semantic hyper-linking [3] which was developed independently of Scott and also uses finite model generation in first order proof search.

134

One idea is as follows: at any point during the search, let S be a consistent subset of the clauses derived so far, and c a particular clause inconsistent with S. Then there are proofs of the empty clause from S ∪ {c}, and c occurs in every such proof. Therefore if the next clause chosen as parent for inferences—the next “given clause” in the standard usage—is c, at least one proof (and possibly many) will be extended by it. Using the oracle, then, we could maintain a record of the maximal consistent subsets of the clauses so far derived, and choose the given clauses from the complements of these sets, thus guaranteeing with every given clause that some proof fragment is extended. Naturally, there is no way to know whether the proofs whose fragments are extended at a given point are the short ones: screening out proofs which contain irrelevant excursions is a different issue, which we do not claim to have addressed. Other uses for an oracle may occur to one. For instance, where the clause space is divided at each point into the “usable” part and the set of support, the oracle may be used to cull many of the clauses in a two-phase operation after each round of inference. First, usable clauses may be deleted as long as the remainder is inconsistent with the set of support; then if the remaining usable set is consistent, clauses may be removed from the set of support to leave a minimal set inconsistent with the usable clauses. Such a culling strategy would result in an extremely focussed search, though again there is no reason to expect the proofs focussed on to be particularly short ones. Unfortunately, of course, there is no such oracle. However, that does not mean we cannot sometimes tell that a formula is satisfiable: for instance, if it happens to have a model in a domain of just three elements, it may be rather easy to find that model and so establish satisfiability. Of course, a search for finite models of this kind cannot always show satisfiable formulae to be satisfiable—if they have only infinite models, or even, in practice, if their finite models are all large—but we hope to show that it can usefully approximate a satisfiability oracle nonetheless.2 It often gives the correct answer, and where it does not, at least it never gives false positives. In order to behave like an oracle—even a dirty one—it must return its result quickly (if not in unit time, at any rate in a time that is short relative to the theorem proving process) and so it should be given a finite bound, on domain size, on elapsed time, on number of nodes expanded in its search tree or something of the sort, in order to force completion. The first idea outlined above survives the move from a perfect oracle to a dirty one which sometimes gives false negatives: the supposedly maximal consistent sets may become a little less than maximal, but they are still consistent and can still focus the search for the most part on proofs rather than on useless sequences of inferences. The second of the above methods does not survive: if the dirty oracle falsely claims inconsistency, formulae may be deleted incorrectly and completeness lost. Hence, for our work on Scott we concentrate on the first method only. 2.2

Semantic Guidance

We recall the given clause algorithm (GCA) used by Scott and by most contemporary highperformance theorem provers.3 At its simplest, it reads as in Figure 1. Complications such as back subsumption and back demodulation have been omitted for clarity. There are four obvious points at which heuristics may apply to affect the efficiency of the search: 1. In the choice of initial partition (step 1). Any consistent UL may be chosen without sacrificing refutation completeness. 2. In the choice of given clause (step 2a). 2

3

It is evident that many heuristics may be modelled as noisy oracles like this. We are not aware of any body of work on oracles with noise, though the idea would appear to be a useful and quite general abstraction. See, for instance, page 5 of [2].

135

1. Partition clause space into usable list UL and set of support SOS. 2. While the empty clause has not been derived and SOS is nonempty do (a) Select a given clause from SOS and move it to UL. (b) Generate clauses by applying the inference rules to the UL with the restriction that each generated clause must have the given clause as one of its parents. (c) Append new clauses that pass the retention tests to SOS. Fig. 1. The given clause algorithm

3. In the choice of which rules of inference to use and with what restrictions on their application (step 2b). 4. In the choice of filters to remove unwanted consequences (step 2c). Scott’s false preference strategy is mainly aimed at the choice of given clauses (though it also affects the initial partition and because it effectively re-orders the set of support it may change the effect of certain filters). As suggested above, our technique is to select given clauses almost always from a collection of co-NMCS (complements of near-maximal consistent sets). To identify these NMCS and witness their consistency, the prover uses models generated at need by a finite domain constraint solver which thus functions as a “dirty oracle” yielding no false positives but an unknown proportion of false negatives. Where S is any consistent set, clearly every proof contains at least one formula from the complement of S. Hence, if at some stage the intersection of some co-NMCS with the set of support is small, it makes sense to take all the clauses in that intersection as the next few given clauses. The result is that every proof is extended either by one of those given clauses or by something already in the usable list. We call this the “semantic queue” strategy. The meaning of “small” for this purpose is something to be determined pragmatically; it seems from our observations that, roughly speaking, single-digit numbers are “small”. In order to enforce fairness, and because there is no easy way to know whether a given maximal consistent subset of the clause space is a good one to use for guidance—and also because there is no knowing in which cases it is a bad idea to trust the model generator to be an oracle—Scott does not rely exclusively on the semantic queue strategy. Where all the co-NMCS are “large”, it lets the choice of given clause cycle through them. Within each co-NMCS, the clauses are ordered by weight and age as normal, so that the choice among them is fair. Moreover, because the cost of maintaining all NMCS would be prohibitive, and again to ensure complete fairness, it occasionally chooses a given clause from among those which are in all known NMCS. 2.3

The tradeoff

Evidently, there is a tradeoff between time spent in the semantic component, searching for models, and time spent in the syntactic component, searching for proofs. Investing time in modelling tends to improve the quality of semantic guidance and therefore to increase efficiency, but on the other hand it is a proof rather than a model which is the main goal. The tradeoff has three aspects. First, as already noted, for a finite model generator to function as a dirty oracle, its search must be finitely bounded. The more generous this bound, the closer we expect the resultant NMCS to approximate a MCS, and hence the higher the quality of its guidance. Generosity costs time, however, so the bound must be set judiciously. The tradeoff is complicated moreover by the fact that some clauses may be hard to model (for example, because they contain many variables or have large term depth) so that greater than usual investment in the model search may be worthwhile in some cases.

136

A second issue is the number of NMCSs maintained. Each NMCS incurs a computational cost: it involves several searches for models, and all kept clauses must at least be tested for membership of it. The more NMCSs maintained, the more comprehensive the semantic coverage of the clause space, but again at the cost of time. It is therefore important to consider strategies for deciding when (and when not) to initiate the search for a new NMCS and when to delete an ineffective NMCS from the list of those maintained. Finally, in order to make a consistent set near-maximal, it is necessary to search for models of some of its supersets. This is very costly, and at some point ceases to be productive. Even without being able to calculate the exact probability of success, we can see the expected return from repeated model searches falling off sharply, and when this happens it is reasonable to fix the model as the best found to that point, to stop searching for improvements, and to treat the set of clauses true in the fixed model as sufficiently near maximal to count as the NMCS for guidance purposes. This leads to yet another tradeoff, between the number of false negatives due to failure to improve the model and the cost of repeatedly searching for improvements.

3

Implementation

Scott (Semantically Constrained OTTer) implements the above guidance strategy. It is based on the theorem prover Otter [2] which employs the given clause algorithm with a range of inference rules including several variants of resolution and hyper-resolution, paramodulation and the like. To generate models as witnesses of NMCSs, it uses the finite domain constraint solver Finder [6] which takes as input a set of first order clauses and seeks to model it in as small a domain as possible. In principle, generating a NMCS is straightforward: simply scan the list of clauses, adding a clause whenever the resultant set can be shown consistent. Consistency is established by finding a model. By listing the clauses in different orders, different NMCSs may be obtained. In Scott this process is dynamic, since the NMCSs change as more clauses are deduced, and so the witnessing models may also be changed during the search. It has always been part of the concept of Scott that semantic guidance should be fitted to the particular proof search, being driven by the actual clauses deduced rather than being set up in advance. Figure 2 shows how Scott builds up its collection of NMCSs and their witnessing models as a by-product of labelling clauses. The algorithm shown in the figure is simplified for clarity: as noted in the last section, there are times when no attempt is made to generate a new theory Tn+1 and when clauses not modelled by a witness are simply labelled as not belonging to the corresponding theory without any attempt to extend the model to accept them. In fact, we have found that it pays to keep tight control over launching searches for new or better models. Most details of how that control is exercised are beyond the scope of an exposition such as this. However, some points of departure from the “pure” algorithm are worth noting: 1. It sometimes happens, as the proof search develops, that one of the theories Ti restricted to the set of support becomes a subset of another Tj . In that case, Ti should be deleted. 2. Sometimes a theory includes the whole set of support. In that case, it makes no relevant distinction and so again should be deleted. Note that discarded theories may become useful again later, as more clauses are deduced, and have to be rediscovered. 3. In order to reduce the number of NMCSs generated, and help to avoid some pathological cases such as modelling the whole set of support, certain clauses are distinguished as a semantic base and required to be in all NMCSs used. In some circumstances Scott may decide to change the semantic base, if the coverage of the set of support is becoming poor. 4. The guidance given by each NMCS is improved by re-testing those clauses marked as excluded from it whenever the witnessing model is updated. This re-testing is cheap compared with the cost of generating the new model, and catches a significant number of false negatives.

137

GIVEN: list of theories (sets of clauses) T1 , . . . , Tn witnessed by models w1 , . . . , wn respectively. GIVEN: procedure ‘oracle’ taking a clause set as parameter and returning either a model of that set or FAIL. PROCEDURE test(c: clause, S: clause set, w: model) : boolean local x : model If w 6|= c then 1. x ← oracle(S ∪ {c}) 2. If x = FAIL then return FALSE 3. w ← x S ← S ∪ {c} return TRUE With each new kept clause c do 1. For each theory hTi , wi i do test(c, Ti , wi ) 2. If for all i, wi 6|= c then (a) Let Tn+1 be {c} and wn+1 be oracle({c}) (b) For each clause d already generated do test(d, Tn+1 , wn+1 ) Fig. 2. The simple Scott algorithm

The performance of earlier versions of Scott [8] showed how prohibitive the cost of maintaining NMCSs could be. The current version benefits from refinements such as those listed above which have resulted from intensive experimental work over a period of years.

4

Performance

Systematic evaluation of the performance of a system which is still under development is problematic. This is especially the case with a prover as complicated as Scott, whose performance depends on the behaviour of the component systems Otter and Finder, their parameter settings, and their coordination. Development of an adequate autonomous mode for Scott is very much work in progress. Hence, at the time of writing, we do not have meaningful statistical results (for example, from running the system across the whole TPTP library). However, the following evidence is encouraging: 1. Scott spends the great majority of its time maintaining semantic information about its clause space. Because of this overhead, and the consequent low inference rate, Scott requires significant time to prove even simple problems, but its performance on harder problems does not degrade as sharply as that of most comparable provers. We illustrate with the following point. 2. For development purposes, we have used a set of 90 problems from TPTP, hand-picked to be challenging for Scott. Currently, with a time limit of 5 minutes Scott solves 22 of these, and in 10 minutes it solves 34. Of course, that it proves 34 theorems is of no interest in itself: the point is that in 10 minutes it proves over 50% more than it does in 5. 3. The system participated in CASC 16. For the full results see http://www.cs.jcu.edu.au/∼tptp/CASC-16/

138

As anticipated, its performance in the MIX category showed the difficulty of devising an autonomous mode, though, it did outperform its component system Otter. In the UEQ division, where the autonomous mode settings have less impact, it performed much more impressively. UEQ results for the top 7 provers are given in Appendix A. It is worth noting that the format of CASC imposes a 5-minute time limit for each problem. Scott therefore spent only a few seconds actually making inferences, so it is surprising that its performance stood comparison with the best (highly tuned) provers. 4. In an interesting technical report [9] Voronkov describes an independent investigation into the CASC-16 systems’ performances. The systems were re-run over the competition problems but with certain trivial transformations such as adding dummy arguments to functions. All of the systems have been trained to some extent on the TPTP problems; the transformations are intended to eliminate any benefit that a system may have from overfitting. Some of the most successful theorem provers performed spectacularly worse on the transformed problems. Scott, however, appeared to be stable, even doing better on the transformed UEQ problems than it did in the competition.

References 1. E Lusk J Slaney and W McCune. Scott: Semantically constrained otter (system description). In Proceedings of 12th International Conference on Automated Deduction, pages 764–768, 1994. 2. W McCune. OTTER 3.0 Reference Manual and Guide, 1994. 3. D Plaisted and Y Zhu. Ordered semantic hyper linking. In Proceedings of 14th National Conference on Artificial Intelligence (AAAI-97), 1997. 4. J R Slagle. Automatic theorem proving with renamable and semantic resolution. Journal of the ACM, 14(4):687–697, 1967. 5. J Slaney. Scott: A model-guided theorem prover. In Proceedings of 13th International Joint Conference on Artificial Intelligence, pages 109–114, 1993. 6. J Slaney. Finder: Finite domain enumerator (system description). In Proceedings of 12th International Conference on Automated Deduction, pages 764–768, 1994. 7. J Slaney and T Surendonk. Combining finite model generation with theorem proving. In Frontiers of Combining Systems, pages 141–155, 1996. 8. G Suttcliffe and C B Suttner. The CADE-15 ATP system competition. Journal of Automated Reasoning, 1998. 9. A Voronkov. CASC-16 21 . Technical report, Department of Computer Science, University of Manchester, UK, 2000. 10. L Wos. Automated reasoning: 33 basic research problems. Prentice Hall, 1988.

139

A

CASC-16 Results for UEQ Problems

COL003-1 COL006-5 COL006-6 COL006-7 COL042-1 COL044-6 COL044-7 COL044-8 COL044-9 COL066-1 GRP024-5 GRP164-1 GRP164-2 GRP185-1 GRP185-3 LAT022-1 LAT023-1 LCL109-2 RNG019-6 RNG021-6 RNG025-4 RNG025-6 RNG026-6 RNG027-8 RNG028-5 RNG028-8 RNG029-5 RNG029-6 RNG035-7 ROB005-1 Attempted Solved Av. Time

Waldmeister 799 798 1.40 timeout 0.00 2.10 0.00 0.00 0.00 0.50 16.60 timeout 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.30 1.20 15.00 54.50 236.00 timeout 236.00 timeout 0.60 0.60 0.10 0.10 11.60 106.60 11.90 48.70 2.00 2.10 0.00 0.00 0.00 0.00 1.20 timeout 0.10 3.10 0.00 9.70 22.00 timeout 25.60 timeout 25.50 timeout 17.90 timeout 18.00 timeout 36.10 timeout 2.50 3.00 30 30 30 19 22.71 12.22

SCOTT 90.20 104.50 timeout timeout timeout 11.10 103.40 115.20 102.90 timeout timeout noproof timeout noproof 104.70 185.90 150.50 timeout 2.90 53.20 98.80 timeout timeout timeout noproof timeout timeout timeout timeout 114.50 30 13 95.22

E 0.5 31.30 296.90 timeout timeout 2.90 timeout timeout timeout timeout 155.40 timeout timeout timeout timeout timeout 246.30 86.30 283.90 timeout timeout timeout 42.30 timeout timeout timeout timeout timeout timeout 288.60 163.90 30 10 159.78

SPASS timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout 0.20 0.90 timeout timeout timeout 0.60 0.70 1.60 1.10 timeout timeout timeout timeout timeout timeout timeout 219.40 30 7 32.07

Fiesta timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout 0.00 175.00 timeout 91.30 63.60 timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout 60.00 30 5 77.98

Otter noproof timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout timeout 0.90 2.10 32.20 timeout timeout timeout timeout timeout timeout timeout timeout 143.40 30 4 44.65

“Timeout” means that the prover was still searching at the end of the 5 minutes allowed. “No proof” means it terminated abnormally; in the case of Scott and Otter this was usually when the set of support had become empty because the weight limit had been set too aggressively. The above results are reproduced from the Web page of CASC-16. In all, eleven theorem provers competed in the UEQ section. For reasons of space, results are given above only for the top seven. The others were E-SETHEO, Gandalf, Vampire and Bliksem.

140

B

Design suite

The following problems from TPTP were used in the development of Scott. BOO003-1 BOO004-1 BOO005-1 BOO006-1 BOO009-1 BOO010-1 BOO015-1 BOO017-1 CAT003-3 CAT004-3 CAT009-4 COL036-1 COL080-2 COL081-2 FLD005-3 FLD006-1 FLD009-3 FLD013-3

FLD013-5 FLD025-3 FLD025-5 FLD043-5 FLD050-4 FLD068-3 FLD070-1 FLD071-2 GEO002-4 GEO004-1 GEO008-3 GEO009-3 GEO010-1 GEO011-1 GEO040-2 GEO059-2 GRP002-1 GRP025-1

GRP075-1 GRP077-1 GRP097-1 GRP098-1 GRP100-1 GRP102-1 GRP107-1 GRP108-1 GRP181-3 GRP181-4 GRP184-4 HEN006-1 HEN010-1 HEN010-3 LAT005-4 LCL005-1 LCL024-1 LCL040-1

LCL071-1 LCL090-1 LCL114-1 LCL116-1 LCL160-1 LCL191-1 LCL194-1 LCL195-1 LDA004-1 LDA006-2 LDA009-2 LDA012-1 RNG001-2 RNG008-2 ROB015-2 SET013-2 SET014-2 SET017-7

This article was typeset using the LATEX macro package with the LLNCS2E class.

141

SET027-3 SET067-6 SET068-6 SET071-6 SET072-7 SET095-6 SET096-6 SET103-7 SET125-6 SET167-6 SET168-6 SET183-6 SET187-6 SET238-6 SET451-6 SET559-6 SET561-6 SET562-6

Suggest Documents