25 July 2003 A revised version of the paper published at Sixth International Conference on Theory and Applications of Satisfiability Testing S. Margherita Ligure - Portofino ( Italy), May 5-8 2003

SATbed: A Configurable Environment for Reliable Performance Experiments with SAT Instance Classes and Algorithms Franc Brglez, Matthias F. Stallmann, and Xiao Yu Li Dept. of Computer Science, NC State Univ., Raleigh, NC 27695, USA {brglez,mfms,xyli}@unity.ncsu.edu

Abstract. Analysis of our recent experiments for a group of SAT solvers and several classes of problem instances suggests a common mathematical framework with experiments in component reliability. In the latter, we observe the distribution of component lifetime; in the former, we observe the distribution of execution time (runtime). A lifetime distribution for hardware components is frequently found to have an exponential, Weibull, Pareto, or gamma distribution. Our experiments with state-of-the-art SAT solvers reveal normal distributions and exponential distributions of runtime and other related random variables, as well as other distributions commonly observed in reliability applications. SATbed, an experimental testbed for SAT solvers, emulates the reliability framework: equivalence classes of isomorphic problem instances (replicated hardware components), subjected to tests with specific SAT solvers (specifically controlled environments), observations of runtime, implications, etc. (lifetime), statistical analysis and modeling, based on random variable samples. The testbed not only facilitates systematic study and reliable improvement of any SAT solver but also supports the introduction and validation of new problem instance classes.

1

Introduction

The traditional benchmarking reports with SAT solvers, including the ones posted on the Web [1], are based on relatively few experiments. Given p problem instances and s solvers, one performs a total of p × s experiments, typically recording a runtime cost function. Experiments are performed on single unrelated instances of benchmarks from the DIMACS set [2], the SATPLAN set [3], the random 3-SAT set [4], and similar. As a measure of solver performance, various statistics are reported, such as sample mean, standard deviation, median, etc. A statistician would argue that experiments as described above should ideally be performed on the p problem instances that are of identical or near-identical ‘hardness’ rather than with unrelated instances whose hardness is not controlled. In the latter case, one cannot tell whether the observed performance variability is induced by the SAT solver or by the lack of control of the ‘hardness’ of problem instances. Increasing the number of such instances does not improve the reliability of statistical reports since there can be significant variability in reporting the

solver performance for each single instance as illustrated in [5, 6] and this paper. More elaborate experiments, discussed next, reveal the same uncertainty about the reliability of experimental observations when one uses ‘randomly generated instances’ – there currently is no control on the ‘hardness’ of such instances. The approach introduced in [5, 6] and supported by SATbed significantly increases the number of experiments when compared to the traditional approach. We consider each instance in the traditional approach as a reference instance for which we create an equivalence class of k instances and the total number of experiments is thus (p × s) × (1 + k). To obtain an acceptable level of statistical significance, we perform experiments with k ≥ 32, which represents a substantial increase in the total number of experiments and data that must be managed and archived. By increasing the number of experiments on well-defined class instances, we argue that the proposed methodology provides insights about the average behavior of the solvers that we cannot possibly gain from traditional experiments. For an example, see the companion paper where the SATbed methodology has been applied to analyze and reliably improve the performance of a new SAT solver [7, 8]. Beyond improving reliability of solver comparisons, our approach also enables fair comparisons between SAT solvers in heretofore incomparable categories. Stochastic search solvers, such as walksat [9] and unitwalk [10], are usually compared by doing statistical analysis of multiple runs with the same inputs, using different starting seeds (see e.g. [11, 12, 10]). Deterministic solvers, such as chaff [13] and sato [14], on the other hand, are compared either on basis of a single run or multiple runs with incomparable inputs. Our experimental methodology puts these different categories of solvers on a level playing field. By introducing syntactical transformations of a problem instance we can now generate, for each reference cnf formula, an equivalence class of as many instances as we find necessary for statistical significance of each experiment with either a deterministic or a stochastic solver [5, 6]. For the stochastic solver, results are (statistically) the same whether we do multiple runs with different seeds on identical inputs or with the same seed on a class of inputs that differ only via syntactical transformations. In the latter case, the same class of inputs can be used to induce a distribution of outcomes for a deterministic solver. Our ability to compare stochastic and deterministic solvers leads to a surprising conclusion in our companion paper [8]. The paper is organized as follows.1 Section 2 reviews the solvability function [5, 6], which relates closely to the well-known survival function in reliability, followed by some representative experiments. Section 3 outlines SATbed components and a typical user-defined configuration that can automate the entire or selected segments of experimental runs. Section 4 introduces a method to generate a family of scheduling problems of satisfiable and unsatisfiable class instances, suitable for asymptotic analysis of solver performance. Section 5 presents webbased organization of current benchmark classes and detailed reports, including statistical analysis, of several SAT solvers on these classes. Section 6 concludes the paper. 1

The initial design of SATbed has been outlined in [5]. A more comprehensive description followed as the the first revised version of [6]. A second revision led to two related but self-contained papers: [6] and this one.

2

2

Background

Basic experiments in component reliability involve a batch of N components, all replicas of a specific reference component, e.g., air-conditioning units, transistors, light bulbs, etc. of specific rating from the same production line. The entire batch is placed into a controlled operating environment at the same time and we record times at which each component fails. Clearly, the component lifetime is a random variable. Records from [15] report the duration of time between successive failures of the air-conditioning systems of each member of a fleet of 13 Boeing 720 jet airplanes. Overall, 213 observations have been recorded, totaling 19,869 hours of service. The sample average T0 , or mean-time-to-failure, is thus 93.3 hours. A close fit is demonstrated between the observed lifetime of the A/C units and the exponential distribution with the parameter T0 . Experiments with a SAT-solver A on N instances from a well-defined equivalence class of cnf formulas [5, 6] show that parameters such as implications and A runtime are random variables X with a cumulative distribution FX (x). In [5, 6], A A we define the estimate of FX (x) as the solvability function S (x): S A (x) = (1/N ) × (number of observations that are ≤ x)

(1)

The solvability function in (1) is the complement of the reliability function (also known as the survival function), denoted as RA (x) = 1 − S A (x) where x may represent the lifetime of any component in a batch of N components undergoing a test in an environment A [16]. We illustrate how the solvability function distinctly characterizes a number of SAT solvers by summarizing three case studies in Figure 1. Four solvers were applied to a large number of instances from several classes: chaff [13], sato and satoL [14], and unitwalk [10]. The 100 instances from the ‘random class’ uf250-1065 [4] can be very different as already demonstrated in [5, 6]. However, the 128 instances from the PC-class uf250-1065 087 PC are generated by replicating a single reference instance #87 from uf250-1065 using simple rewriting rules articulated in [5, 6]. Briefly, we generate a PC-class by randomly renaming and complemeting variables and then randomly permuting literals and clauses of the reference instance. The rewriting for a PC-class neither changes the ‘hardness’ of the instances nor the syntactic structure: it simply permutes and complements the set of all satisfying solutions of the reference instance. A solver that has been ‘trained’ to find a solution for the reference formula may have to be ‘re-trained’ for each new instance in the PC-class. No equivalence-class instances are needed to induce variability in a stochastic solver such as unitwalk; variability can be induced by reading the reference instance only and randomly changing the seed for each repeated run. For convenience, call such a set of repeated runs an II-class (for identical instance). The two solvability functions generated for an II-class and a PC-class are equivalent, as is demonstrated graphically in Figure 1(a) and also analytically by evaluating the t-statistics below.2 2

We have to decide whether the population means are the same for two distributions, given the sample means (m1 , m2 ) and standard deviations (s1 , s2 ) for n1 = n2 =

3

(a) 128 sat instances from the isomorphism class of uf250-1065_087_II AND from the isomorphism class of uf250-1065_087_PC

costID = runtime (seconds)

100

solvability (%)

80 60

UnitWalk-uf250-1065_087_II

40

UnitWalk-uf250-1065_087_PC

20 0

0

20

40

60

80

II-/PC-classes of uf250-1065 087, size=128 solverID min med mean std max uw/II 0.06 7.50 10.9 9.76 51.9 uw/PC 0.01 9.14 12.3 12.5 70.6 uw = unitwalk

runtime (seconds)

(b) 128 sat instances from the isomorphism class of uf250-1065_087_PC 100

solvability (%)

80 60

unitwalk (exponential d.) satoL (near-exponential d.) sato (heavy-tail d.) chaff (near-normal d.)

40 20 0 0.1

1

10

100

1000 2000

costID = runtime (seconds) PC-class of uf250-1065 087, size=128 solverID min med mean std unitwalk 0.01 9.14 12.3 12.5 satoL 0.59 27.2 30.9 19.5 sato 0.15 59.1 148 213 chaff 103 566 614 311

max 70.5 66.0 1100 1431

runtime (seconds)

costID = runtime (seconds)

(c) 100 sat instances from the random class of uf250-1065_R 100

unitwalk (heavy-tail d.)

80 solvability (%)

satoL (exponential d.)

sato (heavy-tail d.) 60

chaff (heavy-tail d.)

40 20 0

0.1

1

10

100

1000 2000

Original random class of uf250-1065, size=100 solverID min med mean std satoL 0.12 12.3 17.1 16.0 unitwalk 0.00 4.94 21.6 43.2 sato 0.07 43.4 105 171 chaff 0.10 17.7 116 244

max 65.5 237 1242 1320

runtime (seconds)

Case studies above illustrate solvability functions (induced on SAT solvers by problem instances in II-, PC-, and the strictly random class uf250-1065, [4]). The instance uf250-1065 087 is the reference instance for the II- and the PC-class [5]. Distributions labeled as exponential strictly satisfy the χ2 -test at 5% level of significance, other distributions are approximations (as outlined in the revised version [6]). (a) The same algorithm (unitwalk) is applied to 128 instances from the II-class and 128 instances from the PC-class. Instances in both classes are replicas of the same reference instance generated under different rules, the solver variability is induced by random selection these instances. In both cases, the instances induce exponential distribution in algorithms’s runtime. A t-test at 5% level of significance shows that the population means of of two distributions are the same – which implies that instances from II-class and PC-class are equally ‘hard’ (as expected by their construction). (b) The four distributions and sample means induced on the four solvers by 128 PC-class instances of uf250-1065 087 are strikingly distinct. (c) Here, we apply four solvers to 100 instances from the original random class. Unlike for the PC-class above, the resulting distributions (and corresponding statistics) cannot be reliably attributed to the performance of SAT solvers alone. Rather, the distributions reflect variability in ‘hardness’ of random test instances themselves [5]. Fig. 1. Case studies of class instances and solvability functions.

128 samples p t-statistics from the formula t = (m1 − p in Figure 1(a). Evaluating m2 )/(σ (1/n1 + 1/n2 )) where σ = ((n1 − 1)s21 + (n2 − 1)s22 )/(n1 + n2 − 2) we 4 find t = 1.169. Since |t| < 1.98 we have to accept, at 5%-level of significance, that the difference between the means of two populations is not significant, and hence the II- and the PC-class are equivalent.

The solvability functions induced by the PC-class uf250-1065 087 PC on the four solvers are shown in Figure 1(b). The variability from 0.1 seconds to over 1000 seconds is not confined to 3-sat instances only. We report similar variability ranges for four solvers across many equivalence classes in [5, 6]. Solvability functions induced by these classes on the four solvers report cases of exponential, near-exponential, normal, near-normal, heavy-tail, and incomplete distributions. It should be noted that the solvability function reported by unitwalk fits the exponential distribution as closely as the one reported for the air-conditioning units [15]. In contrast to the considerable spread of four solvability functions induced by the PC-class uf250-1065 087 PC (where the ‘hardness’ of each instance is the same), consider the ‘clustering’ and the prevalence of ‘heavy-tail’ distributions in Figure 1(c). Here, the ‘random class’ instances induce heavy-tail distribution on three solvers which makes comparisons of statistics between these solvers unreliable. It is far from clear how much of runtime variability has been induced by the ‘hardness’ variability of these instances and how much by the intrinsic variability of the solvers. A wide range in ‘hardness’ variability of these instances has already been demonstrated in [5, 6]. The development of improved SAT solvers relies on the presence of challenging benchmark instances, the reference instances in our set-up. A reliable and systematic study of different reference formulas requires the same set-up as the study of SAT solver performance. Neither can be significantly improved without the other. Both require a flexible experimental test bed such as the environment we describe next.

3

SATbed Architecture and Components

The architecture of SATbed supports an environment that is open both to the SATbed maintainer and the SATbed user. The maintainer can readily install the basic environment with at least one standardized component in each category: A reference instance generator (REFGEN) takes one or more command-line arguments (to designate size, difficulty, etc.) and produces one or more output files in .cnf format with names generated internally by the program. An instance class generator (CLASSGEN) takes a random seed and a .cnf file and produces a single instance of a class. This instance is derived by applying appropriate transformations, e.g., random variable permutation/renaming for a P class. The output, also in .cnf format, is self-documenting — comments give the name of the reference, the starting and ending random seed, and the exact set of transformations used. The SATbed software provides the iteration that produces multiple instances (the configuration file can specify number of instances as an option; the default is 32). A sat solver (SOLVER) takes a single instance in .cnf format and appends raw output data to a file called solverID.raw in a directory that uniquely identifies the set of input files from which the instance comes (either a class or a collection of references created by a reference generator). If the instance is satisfiable, the output includes a satisfying assignment. In any case, a solver will report various statistics such as execution time, number of backtracks, implications, etc. 5

This figure shows the flow of SATbed and a sample configuration script. The script can be used to specify where to start and end processing: to avoid time-consuming re-computation, for example. Naming conventions for files/directories are simple: sched07u PC, for example, is the name of the directory storing instances of the PC class for sched07u. The user must specify the command-line syntax of a reference generator (the line beginning sched =); all other programs are assumed to take two file names (usually input and output) as command-line arguments. A class generator uses the first line of the input file for the random seed; the reference as a .cnf file follows; output is a single class instance with the reference, the starting and ending seed, and the transformation used recorded as comments. A solver reads a .cnf input file and outputs its own raw format; most solvers are executed from encapsulation scripts that rearrange the command-line arguments to conform to SATbed specifications. Each solver must also supply a program called a post-processor that reads the raw output to extract data in SATbed-specific tabular format. STOP

START REFGEN

parameters sched

CLASSGEN

reference instance(s)

SOLVER

raw data

class make_p make_c make_pc seed

POST_PROCESS

STATISTICS

additional statistics tabular data runtime statistics

sato chaff UnitWalk

VIEWER

viewer options

data formatted for documents or viewing

##: demo-1.cfg - a demo configuration file for the @SATbed script START = REFGEN STOP = POST PROCESS # directories for inputs and results BENCHMARK DIR = /mnt/fileserver/SAT-EXP/benchmarks RESULTS DIR = /mnt/fileserver/SAT-EXP/results # program id’s and programs for each stage REFGEN = sched medium s sched medium s = sched classic.tcl 3 7 yes # sizes 3-7, satisfiable CLASSGEN = PC PC = rotate formula PC CLASS SIZE = 10 # default = 32 SEED = 33,1546,1968 # random seed (3 short ints, IEEE 48) SOLVER = chaff,unitwalk chaff = chaff encap.tcl unitwalk = unitwalk encap.tcl TIME OUT = 1800 # stop after this many seconds SOLVE CLASSES = sched medium s # Each program ID in the POST PROCESS stage must have a name # of the form solver pp, where solver is one of the solvers POST PROCESS = chaff pp,unitwalk pp chaff pp = chaff postProcess.tcl unitwalk pp = unitwalk postProcess.tcl Fig. 2. SATbed flow and an example of an experiment-specific configuration file.

6

A sat solver post-processor (SOLVER PP) currently contains a solver-specific component that turns the ‘raw’ solver output into a standardized SATbed tabular format solverID.tab. Fields such as instanceName, resolutionStatus, verificationStatus, timeout, runtime, solutionString are required; optional fields such as flips, implications, backtracks, are solver-specific. Each instance is resolved as sat, unsat, and due to timeout, also unsolved. Currently, SATbed supports post-processors for the following SAT solvers: chaff, dp0, OpenSAT, QingTing, satire, sato, satoL, unitwalk, and walksat. A generic verifier program extracts the reported solution (last column in the tabular format) and upon instance verification, finalizes the file solverID.tab: verificationStatus is reported as verSkip (if no solution is reported), verFail (if the solution does not verify), or verPass (if the solution verifies). This tabular file can be directly imported into any spreadsheet for analysis and plotting and is also available in the html format for ready viewing with a webbrowser. Most importantly, the rows that contain the unsolved and verFail in this tabular file support ”data censoring”3 for SATbed utilities that generate various statistics, including distribution fitting, solvability functions, and box plots. By default, only runtime data is processed by the SATbed statistical package in this stage. The user-invoked statistical applications (STATISTICS) stage allows the user to select any of the optional fields such as flips, implications, backtracks and re-invoke the SATbed statistical package, thereby augmenting the runtime data statistics generated by default during the post-processing stage. Such additional statistics can be useful to identify platform-independent performance metrics common for a category of solvers. For example, we observed a near-perfect correlation of runtime versus implications for a number of DPLL-based solvers [5, 6]. Additional statistical packages, not part of the post-processing stage, can also be invoked to analyze data columns in solverID.tab. Viewer applications (VIEWER) may range from web-browsers to various experimental rendering programs or to commercial statistical packages such as JMP from SAS. Unlike the post-processing and statistics stages, such programs may access data about multiple solvers and/or multiple classes to create a single chart or a table. Examples of such programs include the automated generation of LaTex-formatted statistical reports such as presented in Table 1 and in [6], and pairwise t-test reports discussed in [8]. Once the basic testbed is installed, the maintainer can add more reference instance generators, instance class generators, and sat solvers. If the solver postprocessor is not available, the maintainer will need to create one and add it to the testbed. As to the statistics and viewer applications, the possibilities are limitless. A view of the SATbed architecture as a simple chain of its components is shown in Figure 2. Components targeted for execution in this chain are selected with START and STOP options in the experiment-specific configuration file. Such self-documenting files are maintained by the user for each of the experimental 3

Data samples in the row that contains either an unsolved of an verFail entry are not included in SATbed-specific statistical reports.

7

designs that are to be executed. The name of this file is the only argument expected when invoking SATbed from the command line. Figure 2 shows some of the typical lines that configure the flow of the experiment before its invocation. For more details, see [17]. The architecture as we conceived it also allows substitution of back-end components by the maintainer. Taking advantage of parallel or distributed processing capabilities to execute several solvers or several class instances simultaneously is one application where this has clear advantages. We turn now to a family of benchmarks instances whose refinement relies heavily on the SATbed approach.

4

Sched Class Generation

A collection of new benchmark problem instances whose difficulty we are investigating is based on a unit-length-task scheduling problem, defined as follows. Consider a set T of n tasks to be mapped to time slots 1, . . . , n on a single processor, and let S(t) denote the slot to which task t is assigned by the schedule (mapping) S. There are two constraints on the problem. One is that each task t has a deadline d(t) that imposes the condition S(t) ≤ d(t). The other is a precedence graph G = (T, P ) with tasks as nodes and where a directed edge tu ∈ P means that S(t) < S(u). We refer to the problem of determining whether a feasible schedule exists as 1-PCS (1-processor precedence-constrained scheduling). The left picture in Figure 3 shows a top-down view of the precedence graph we are currently considering, and the right picture shows the graph for sched03, one example among a whole class we call recursive N-graphs. The right picture also shows the recursive part of the definition of these graphs; for larger instances, each node of one of the smaller N-graphs could, in turn, be another N-graph. In the example shown, the tasks labeled cc would have to be scheduled, from top to bottom, in slots 1–4 to meet the deadline (4) of the bottom-left task. These would have to be followed by the three sc tasks (deadline of 7), then the cs (11) and ss (13). To turn an arbitrary 1-PCS instance into an instance of SAT, we introduce variables si,j for 1 ≤ i, j ≤ n with the interpretation that si,j is true whenever task i is scheduled in slot j. We need only three types of clauses. 1. Each task must be scheduled by its deadline: clauses (si,1 ∨ si,2 ∨ · · · ∨ si,d(i) ) for 1 ≤ i ≤ n, 2. At most one task can be scheduled in any slot: clauses (si,k ∨ sj,k ) for 1 ≤ k ≤ n and for each pair i, j with d(i), d(j) ≤ k, and 3. Precedence constraints must be observed: clauses (st,i ∨ su,i+1 ∨ · · · ∨ su,d(u) ) when tu ∈ P and for 1 ≤ i ≤ d(t). Everything said so far assumes we are generating satisfiable instances of scheduling. An unsatisfiable instance is created if the latest deadline is decreased by 1. The scheduling instances in our current experiments are named schednns (satisfiable) or schednnu (unsatisfiable), where nn is a two-digit number specifying the size of the precedence graph. The instance sched00 (both s and u) uses 8

The figure on the left shows a top-level view of a recursive N-graph. Four nodes, N0 , N1 , N2 , and N3 , are joined via three directed edges, N0 N1 , N0 N3 , and N2 N3 . The left path, N0 N1 , is the critical path (C), and N2 N3 the secondary path (S). As illustrated on the right, each of the nodes can either contain another recursive N-graph (as is the case with N0 , N1 , and N2 ) or be a single node (N3 ). In general an edge between two nodes, such as the edge N0 N1 , results in three edges between the nodes inside N0 and N1 . Two of these connect the last node on the critical path inside N0 with the first nodes of both paths inside N1 ; the third connects the last node of the secondary path inside N0 with the last node of the secondary path inside N1 . The edges N0 N3 and N2 N3 each yield only one edge because N3 only has one node, which becomes its critical path by default. For scheduling instances we identify four distinct paths through an N-graph having more than one layer (that is, one in which not all nodes are singletons) — cc (critical-critical) follows the critical path at the top level and critical paths within all nodes encountered; sc (secondary-critical) follows the secondary path at the top level and critical paths in the nodes encountered; cs and ss are analogous. N0 N0

N2

cc

cs

sc

ss

cs

sc

ss 13

C

S

cc

C

S

cc

N1

N2

cs sc

cc

N3

4

cs 11

N1

7

N3

Fig. 3. The recursive N-graph used for a satisfiable sched03 instance.

a simple 4-node, 3-edge N-graph. In general, if nn = 4k + r, where 1 ≤ r ≤ 4, then schednn uses a recursive N-graph in which nodes Ni contain the graph for size 4k when i < r and the graph of size 4(k − 1) (or a single node if k = 0) when i ≥ r. The 1-PCS problem is not NP-complete — in fact, PCS does not become NP-complete until the number of processors is three [18, p. 239]. It is easy to generalize the second set of clauses to deal with any fixed number of processors. With three processors, for example, the clauses would take the form (sg,k ∨ sh,k ∨ si,k ∨ si,k ) for 1 ≤ k ≤ n and for all quadruples g, h, i, j with d(g), d(h), d(i), d(j) ≤ k. We anticipate further experiments with different types of precedence graphs (the recursive construction for the N-graphs can easily be generalized to any graph that has two paths), more processors (the simplest way to create multiprocessor instances is to make parallel copies of the precedence graphs, or use multiple different graphs with the same number of nodes), and/or variations in deadline setting. Such experiments are greatly facilitated with our testbed. 9

In the original scheduling instances (discussed, e.g., in [5]) we made two adjustments to the deadlines before converting to SAT: (a) whenever a task t precedes a task u, we make d(t) ≤ d(u) − 1, if it isn’t already, and (b) the earliest deadline of a node with no successors is increased by 1. The purpose of adjustment (a) is to eliminate unnecessary variables; adjustment (b) avoids unit clauses caused by “tight” precedence constraints. Another set of scheduling instances was recently created without either of these adjustments — unit clauses were simply eliminated by propagation and the resulting instances had fewer variables (in fact, sizes 00–02 had no clauses remaining after unit propagation). Table 1 illustrates the dramatic difference between these two types of instances, as well as the unusual rank ordering of the solvers.4 Whereas instances sched07s v01386 and sched07u v01384 are from our original benchmark set, instances sched04z07s v00655 and sched04z06u v00350 are from the new set. Despite the smaller number of variables, the new instances appear to be harder to solve, so much so that we were unable to use size 07 (with 649 variables) on the unsatisfiable side (even size 06 caused timeouts almost in almost every case but size 05, with 111 variables, was too trivial — runtimes were around 0.01 seconds). The original scheduling instances provide a much more significant challenge for chaff than for other solvers (except that walksat appears completely unable to find solutions in reasonable time for the satisfiable case). This is in contrast to chaff ’s superior performance in other difficult problem domains, such as the blocks world [5, 6]. The erratic behavior of sato may be due to bugs in the implementation (some of these are noted by the author [14]), while satoL, a related solver with simpler data structures,5 has not only the fastest, but also the most predictable, execution times. The tables are radically turned when we turn to the new scheduling instances (655 variables and 8623 clauses for the size 07 satisfiable one, 350 variables and 3394 clauses for the size 06 unsatisfiable one). Tied for best solver in the satisfiable case is walksat, which was not even a contender with the original instance. The clear loser is satoL, the fastest solver for the original instance; it times out more often than not and runs hundreds of seconds even when it doesn’t. As already pointed out, the relatively small unsatisfiable new scheduling instance appears to be too challenging (within our time-out limit of 1000 seconds) for any of the solvers we tested. We will fill in more detail with longer runs (and/or faster processors) for the final version of this paper. A key point about our work with the scheduling instances is our ability to use SATbed to quickly create and report experimental results for a whole family of related instances of different sizes. A family of scheduling instances might consist of the (PC classes of the) five instances sched04s–sched08s. These range from 140/892 variables/clauses to 2088/47080. The smallest is suitable for detailed tracing while the largest is challenging for even the fastest solvers. Five size levels are also enough to observe important asymptotic trends, e.g., a solver that is not competitive on the given instances may have an execution time that grows significantly more slowly than that of other solvers, an observation that would make it a solver of choice given either a faster machine or a faster implementation. 4 5

Runtime is on a 350MHz Power PC (Apple MacIntosh G3) with 384MB memory. This is sato run with the -d1 switch.

10

The development of the new, harder scheduling instances went through four revisions before arriving at the current level of challenge. This process would have been nearly impossible to manage without SATbed.

5

Current Postings on the Web

A well-defined schema is required to manage large volumes of input data sets and repeated executions of several solvers, each writing results in solver-specific formats. Two components of the experimental design schema (EDS) that evolved Table 1. SAT solver comparisons on satisfiable and unsatisfiable instances in PC classes of two sched families. costID = timeToSolve (seconds) – sat instances solverID satoL unitwalk sato chaff walksat(1)

solverID sato(2) walksat unitwalk chaff

Class labels: name=sched07s v01386, type=PC, size=32 initV minV meanV maxV max/min stdev 0.08 0.04 0.07 0.13 3.25 0.02 0.06 0.06 0.41 2.33 38.8 0.47 0.11 0.09 1.28 37.1 412 6.54 16.7 1.49 11.0 36.3 24.4 7.49 1000 1000 1000 1000 1.00 0.00

distribution near-normal exponential heavy-tailed exponential incomplete

Class labels: name=sched04z07s v00655, type=PC, size=32 initV minV meanV maxV max/min stDev 0.31 0.04 0.60 2.31 57.8 0.64 2.27 0.03 0.82 2.34 78.0 0.63 1.86 0.05 2.37 12.2 244 2.62 1.98 0.20 30.2 165 825 43.1

distribution exponential exponential exponential exponential

costID = timeToSolve (seconds) – unsat instances solverID satoL sato(3) chaff

Class initV 0.18 0.18 5.68

labels: name=sched07u v01384, type=PC, size=32 minV meanV maxV max/min stDev 0.18 0.28 0.31 1.72 0.02 0.18 0.24 0.28 1.56 0.02 1.63 9.13 27.8 17.1 6.27

solverID chaff(4) sato(4) satoL(4)

Class labels: name=sched04z06u v00350, type=PC, size=8 initV minV meanV maxV max/min stDev distribution 1000 1000 1000 1000 1.00 0.00 NA 1000 1000 1000 1000 1.00 0.00 NA 1000 1000 1000 1000 1.00 0.00 NA

(1)

distribution near-normal incomplete exponential

All attempted instances timed out at 1000 seconds, but some runs are still pending. One instance timed out at 1000 seconds (possibly a bug in the solver); the statistics are for the remaining 32. (3) Three instances timed out at 1000 seconds (possibly a bug in the solver); the statistics are for the remaining 30. (4) All attempted instances timed out at 1000 seconds. (2)

11

benchm SATcnf + bw large s + bw large u ... + queen medium - queen small - @references queen04 v00016.cnf queen04 v00025.cnf queen04 v00036.cnf ... ... - queen 04 v00016 - queen04 v00016 C i0000.cnf i0001.cnf ... + queen04 v00016 I + queen04 v00016 P + queen04 v00016 PC + queen 04 v00025 + queen 04 v00036 ... + sched s + sched u ... ...

results SATcnf + chaff + dp0 nat ... + satire - sato + bw large s ... + queen medium + queen small - @references sato.raw sato.html ..... - queen 04 v00016 - queen04 v00016 C sato.raw sato.html .... + queen04 v00016 P + queen04 v00016 PC + queen 04 v00025 ... + sched s ... + satoL ...

Fig. 4. Organization of benchmark classes and experimental results in SATbed.

in this work are shown in Figure 4: benchm SATcnf archives all input data sets, results SATcnf archives results of every experiment generated by each solver. As discussed in Section 3, the initial results of experiments are in solver-specific (.raw) formats and are post-processed into several formats and posted on the Web. An example of such postings is shown in Figure 5: three solvers, dp0 nat, chaff, and sato have been applied to class instances from sched03s v0095 (introduced in this paper). Note that distributions of reported variables range from exponential to normal, heavy-tail, as well as impulse (0 backtracks are reported for sato for all instances under test). Since dp0 nat is executed under a Tcl interpreter (and relatively slow compared to programs in C), the underlying solvability graph displays the number of implications required by each solver. The crossover of dp0 nat and chaff makes this benchmark class (and others in the sched family) of particular interest for further study. To access compressed archives of input data sets under benchm SATcnf and raw results, statistical summaries, and tabulated data under results SATcnf, and the SATbed with complete documentation and running examples, reader is invited to follow the links posted on the home page http://www.cbl.ncsu.edu/OpenExperiments/SAT/

6

Conclusions

In order for the equivalence-class methodology proposed in [5, 6] to gain wide acceptance, the generation of equivalence classes, the execution of multiple solvers on these classes, and the extraction of meaningful data must be simplified as much as possible. The SATbed software described here is, we believe, an im12

(a) dp0 nat.html file (table of samples) extracted from dp0 nat.raw file

(b) dp0 nat stat.html file (table of statistics) extracted from dp0 nat.html

(c) chaff stat.html file (table of statistics) extracted from chaff.html

(s) sato stat.html file (table of statistics) extracted from sato.html

(e) solvability functions computed on samples from three solvers above

Fig. 5. Experimental results reported by three SAT solvers for instances from the class of sched03s v0095 PC. Tables are posted on the Web by SATbed as experiments unfold.

13

portant step in this direction. When installed, users will be able to experiment readily with new solvers, and do so with reliability and statistical significance. They will be able to fairly compare solvers in previously hard-to-compare categories, such as stochastic versus deterministic. They will also be able to develop and refine new, more difficult benchmark instances and class-generation methodologies. Finally, large experiments on families of related instances will become more manageable than ever before. Setting up, running, and analyzing data for the required experiments on a family of instances can be a daunting task, error-prone if done manually. Not only does SATbed make gathering data for a family of instance classes manageable, it also allows researchers to gain insights from variations of the instance characteristics. The process of creating and refining “hard” instances of SAT, if done via careful experimentation on instance classes and families, yields important insights that cannot be gained from the development of better solvers, but could lead to the latter in the long run. We invite researchers to check out the tutorial about benchmark equivalence classes, results of our experiments, and a brief guide on using SATbed [17]. This testbed will allow researchers not only to replicate any of web-posted experiments but also to contribute new class instances and share new experimental results – which again can be replicated by others. Acknowledgments. The experiments, as reported in this paper, could not have taken place without SAT solvers such as chaff, satire, sato, satoL, unitwalk and walksat. We thank authors for the ready access and the exceptional ease of installation of these software packages. We also thank Dr. Jason Osborne from the Department of Statistics (NCSU) for the advice on the likelihood ratio test statistic for exponential distributions. We appreciate the re-assurance that in our experiments with 128 samples, the simpler t-test statistic has sufficient power to resolve the hypothesis of equal population means of two exponential distributions.

References 1. Laurent Simon. Sat-Ex: The experimentation web site around the satisfiability , 2003. For more information, see http://www.lri.fr/~simon/satex/satex.php3. 2. Michael A. Trick. Second dimacs challenge test problems. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 26:653–657, 1993. The SAT benchmark sets are available at ftp://dimacs.rutgers.edu/pub/challenge/satisfiability. 3. Henry Kautz, David McAllester, and Bart Selman. Encoding plans in propositional logic. KR’96: Principles of Knowledge Representation and Reasoning, pages 374–384, 1996. The SATPLAN benchmark set is available at http://sat.inesc.pt/benchmarks/cnf/satplan/. 4. SATLIB - The Satisfiability Library, 2003. See http://www.satlib.org. 5. F. Brglez, X. Y. Li, and M. Stallmann. The Role of a Skeptic Agent in Testing and Benchmarking of SAT Algorithms. In Fifth International Symposium on the Theory and Applications of Satisfiability Testing, May 2002. Available at http://www.cbl.ncsu.edu/publications/ . 6. F. Brglez, X. Y. Li, and M. Stallmann. On SAT Instance Classes and a Method for Reliable Performance Experiments with SAT Solvers. Annals of Mathematics and Artificial Intelligence (AMAI), Special Issue on Satisfiability Testing, 2003. Under

14

7.

8.

9. 10. 11. 12. 13.

14. 15. 16. 17. 18.

review. Submitted to AMAI as the revision of the paper published at the Fifth International Symposium on the Theory and Applications of Satisfiability Testing, Cincinnati, Ohio, USA, May 2002. Available at http://www.cbl.ncsu.edu/publications/ . X. Y. Li, M. Stallmann, and F. Brglez. QingTing: A Fast SAT Solver Using Local Search and Efficient Unit Propagation. In Proceedings of SAT 2003, Sixth International Symposium on the Theory and Applications of Satisfiability Testing, May 5-8 2003, S. Margherita Ligure - Portofino, Italy, May 2003. Available from http://www.cbl.ncsu.edu/publications/ . X. Y. Li, M. F. Stallmann, and F. Brglez. QingTing: A Local Search SAT Solver Using an Effective Switching Strategy and an Efficient Unit Propagation. SpringerVerlag Lecture Notes in Computer Science (LNCS), Special Issue on Satisfiability Testing, 2003. Under review. Submitted to LNCS as the revision of the paper published at the Sixth International Symposium on the Theory and Applications of Satisfiability Testing, S. Margherita Ligure - Portofino ( Italy), May 5-8 2003. Available from http://www.cbl.ncsu.edu/publications/. David A. McAllester, Bart Selman, and Henry A. Kautz. Evidence for invariants in local search. In AAAI/IAAI, pages 321–326, 1997. E. Hirsch and A. Kojevnikov. UnitWalk: A new SAT solver that uses local search guided by unit clause elimination, 2001. PDMI preprint 9/2001, Steklov Institute of Mathematics at St.Petersburg, 2001. Holger H. Hoos and Thomas St¨ utzle. Evaluating Las Vegas Algorithms – Pitfalls and Remedies. In UAI-98, Morgan Kaufmann Publishers, pages 238–245. Morgan Kaufmann Publishers, 1998. Holger H. Hoos and Thomas St¨ utzle. Local Search Algorithms for SAT: An Empirical Evaluation. Journal Of Automated Reasoning, 24, 2000. Matthew Moskewicz, Conor Madigan, Ying Zhao, Lintao Zhang, and Sharad Malik. Chaff: Engineering an efficient SAT solver. In IEEE/ACM Design Automation Conference (DAC), 2001. Version 1.0 of Chaff is available at http://www.ee.princeton.edu/ chaff/zchaff/zchaff.2001.2.17.src.tar.gz. Hantao Zhang. SATO: An efficient propositional prover. In Conference on Automated Deduction, pages 272–275, 1997. Version 3.2 of SATO is available at ftp://cs.uiowa.edu/pub/hzhang/sato/sato.tar.gz. F. Prochan. Theoretical explanation of observed decreasing failure rate. Technometrics, 5:375, 1963. F. Jense. Electronic Component Reliability: Fundamentals, Modelling, Evaluation, and Assurance. J. Wiley, 1996. F. Brglez, M. Stallmann, and X. Y. Li. SATbed Home Page: A Tutorial, A User Guide, A Software Archive, Archives of SAT Instance Classes and Experimental Results, 2003. Available at http://www.cbl.ncsu.edu/OpenExperiments/SAT/ . M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979.

15

SATbed: A Configurable Environment for Reliable Performance Experiments with SAT Instance Classes and Algorithms Franc Brglez, Matthias F. Stallmann, and Xiao Yu Li Dept. of Computer Science, NC State Univ., Raleigh, NC 27695, USA {brglez,mfms,xyli}@unity.ncsu.edu

Abstract. Analysis of our recent experiments for a group of SAT solvers and several classes of problem instances suggests a common mathematical framework with experiments in component reliability. In the latter, we observe the distribution of component lifetime; in the former, we observe the distribution of execution time (runtime). A lifetime distribution for hardware components is frequently found to have an exponential, Weibull, Pareto, or gamma distribution. Our experiments with state-of-the-art SAT solvers reveal normal distributions and exponential distributions of runtime and other related random variables, as well as other distributions commonly observed in reliability applications. SATbed, an experimental testbed for SAT solvers, emulates the reliability framework: equivalence classes of isomorphic problem instances (replicated hardware components), subjected to tests with specific SAT solvers (specifically controlled environments), observations of runtime, implications, etc. (lifetime), statistical analysis and modeling, based on random variable samples. The testbed not only facilitates systematic study and reliable improvement of any SAT solver but also supports the introduction and validation of new problem instance classes.

1

Introduction

The traditional benchmarking reports with SAT solvers, including the ones posted on the Web [1], are based on relatively few experiments. Given p problem instances and s solvers, one performs a total of p × s experiments, typically recording a runtime cost function. Experiments are performed on single unrelated instances of benchmarks from the DIMACS set [2], the SATPLAN set [3], the random 3-SAT set [4], and similar. As a measure of solver performance, various statistics are reported, such as sample mean, standard deviation, median, etc. A statistician would argue that experiments as described above should ideally be performed on the p problem instances that are of identical or near-identical ‘hardness’ rather than with unrelated instances whose hardness is not controlled. In the latter case, one cannot tell whether the observed performance variability is induced by the SAT solver or by the lack of control of the ‘hardness’ of problem instances. Increasing the number of such instances does not improve the reliability of statistical reports since there can be significant variability in reporting the

solver performance for each single instance as illustrated in [5, 6] and this paper. More elaborate experiments, discussed next, reveal the same uncertainty about the reliability of experimental observations when one uses ‘randomly generated instances’ – there currently is no control on the ‘hardness’ of such instances. The approach introduced in [5, 6] and supported by SATbed significantly increases the number of experiments when compared to the traditional approach. We consider each instance in the traditional approach as a reference instance for which we create an equivalence class of k instances and the total number of experiments is thus (p × s) × (1 + k). To obtain an acceptable level of statistical significance, we perform experiments with k ≥ 32, which represents a substantial increase in the total number of experiments and data that must be managed and archived. By increasing the number of experiments on well-defined class instances, we argue that the proposed methodology provides insights about the average behavior of the solvers that we cannot possibly gain from traditional experiments. For an example, see the companion paper where the SATbed methodology has been applied to analyze and reliably improve the performance of a new SAT solver [7, 8]. Beyond improving reliability of solver comparisons, our approach also enables fair comparisons between SAT solvers in heretofore incomparable categories. Stochastic search solvers, such as walksat [9] and unitwalk [10], are usually compared by doing statistical analysis of multiple runs with the same inputs, using different starting seeds (see e.g. [11, 12, 10]). Deterministic solvers, such as chaff [13] and sato [14], on the other hand, are compared either on basis of a single run or multiple runs with incomparable inputs. Our experimental methodology puts these different categories of solvers on a level playing field. By introducing syntactical transformations of a problem instance we can now generate, for each reference cnf formula, an equivalence class of as many instances as we find necessary for statistical significance of each experiment with either a deterministic or a stochastic solver [5, 6]. For the stochastic solver, results are (statistically) the same whether we do multiple runs with different seeds on identical inputs or with the same seed on a class of inputs that differ only via syntactical transformations. In the latter case, the same class of inputs can be used to induce a distribution of outcomes for a deterministic solver. Our ability to compare stochastic and deterministic solvers leads to a surprising conclusion in our companion paper [8]. The paper is organized as follows.1 Section 2 reviews the solvability function [5, 6], which relates closely to the well-known survival function in reliability, followed by some representative experiments. Section 3 outlines SATbed components and a typical user-defined configuration that can automate the entire or selected segments of experimental runs. Section 4 introduces a method to generate a family of scheduling problems of satisfiable and unsatisfiable class instances, suitable for asymptotic analysis of solver performance. Section 5 presents webbased organization of current benchmark classes and detailed reports, including statistical analysis, of several SAT solvers on these classes. Section 6 concludes the paper. 1

The initial design of SATbed has been outlined in [5]. A more comprehensive description followed as the the first revised version of [6]. A second revision led to two related but self-contained papers: [6] and this one.

2

2

Background

Basic experiments in component reliability involve a batch of N components, all replicas of a specific reference component, e.g., air-conditioning units, transistors, light bulbs, etc. of specific rating from the same production line. The entire batch is placed into a controlled operating environment at the same time and we record times at which each component fails. Clearly, the component lifetime is a random variable. Records from [15] report the duration of time between successive failures of the air-conditioning systems of each member of a fleet of 13 Boeing 720 jet airplanes. Overall, 213 observations have been recorded, totaling 19,869 hours of service. The sample average T0 , or mean-time-to-failure, is thus 93.3 hours. A close fit is demonstrated between the observed lifetime of the A/C units and the exponential distribution with the parameter T0 . Experiments with a SAT-solver A on N instances from a well-defined equivalence class of cnf formulas [5, 6] show that parameters such as implications and A runtime are random variables X with a cumulative distribution FX (x). In [5, 6], A A we define the estimate of FX (x) as the solvability function S (x): S A (x) = (1/N ) × (number of observations that are ≤ x)

(1)

The solvability function in (1) is the complement of the reliability function (also known as the survival function), denoted as RA (x) = 1 − S A (x) where x may represent the lifetime of any component in a batch of N components undergoing a test in an environment A [16]. We illustrate how the solvability function distinctly characterizes a number of SAT solvers by summarizing three case studies in Figure 1. Four solvers were applied to a large number of instances from several classes: chaff [13], sato and satoL [14], and unitwalk [10]. The 100 instances from the ‘random class’ uf250-1065 [4] can be very different as already demonstrated in [5, 6]. However, the 128 instances from the PC-class uf250-1065 087 PC are generated by replicating a single reference instance #87 from uf250-1065 using simple rewriting rules articulated in [5, 6]. Briefly, we generate a PC-class by randomly renaming and complemeting variables and then randomly permuting literals and clauses of the reference instance. The rewriting for a PC-class neither changes the ‘hardness’ of the instances nor the syntactic structure: it simply permutes and complements the set of all satisfying solutions of the reference instance. A solver that has been ‘trained’ to find a solution for the reference formula may have to be ‘re-trained’ for each new instance in the PC-class. No equivalence-class instances are needed to induce variability in a stochastic solver such as unitwalk; variability can be induced by reading the reference instance only and randomly changing the seed for each repeated run. For convenience, call such a set of repeated runs an II-class (for identical instance). The two solvability functions generated for an II-class and a PC-class are equivalent, as is demonstrated graphically in Figure 1(a) and also analytically by evaluating the t-statistics below.2 2

We have to decide whether the population means are the same for two distributions, given the sample means (m1 , m2 ) and standard deviations (s1 , s2 ) for n1 = n2 =

3

(a) 128 sat instances from the isomorphism class of uf250-1065_087_II AND from the isomorphism class of uf250-1065_087_PC

costID = runtime (seconds)

100

solvability (%)

80 60

UnitWalk-uf250-1065_087_II

40

UnitWalk-uf250-1065_087_PC

20 0

0

20

40

60

80

II-/PC-classes of uf250-1065 087, size=128 solverID min med mean std max uw/II 0.06 7.50 10.9 9.76 51.9 uw/PC 0.01 9.14 12.3 12.5 70.6 uw = unitwalk

runtime (seconds)

(b) 128 sat instances from the isomorphism class of uf250-1065_087_PC 100

solvability (%)

80 60

unitwalk (exponential d.) satoL (near-exponential d.) sato (heavy-tail d.) chaff (near-normal d.)

40 20 0 0.1

1

10

100

1000 2000

costID = runtime (seconds) PC-class of uf250-1065 087, size=128 solverID min med mean std unitwalk 0.01 9.14 12.3 12.5 satoL 0.59 27.2 30.9 19.5 sato 0.15 59.1 148 213 chaff 103 566 614 311

max 70.5 66.0 1100 1431

runtime (seconds)

costID = runtime (seconds)

(c) 100 sat instances from the random class of uf250-1065_R 100

unitwalk (heavy-tail d.)

80 solvability (%)

satoL (exponential d.)

sato (heavy-tail d.) 60

chaff (heavy-tail d.)

40 20 0

0.1

1

10

100

1000 2000

Original random class of uf250-1065, size=100 solverID min med mean std satoL 0.12 12.3 17.1 16.0 unitwalk 0.00 4.94 21.6 43.2 sato 0.07 43.4 105 171 chaff 0.10 17.7 116 244

max 65.5 237 1242 1320

runtime (seconds)

Case studies above illustrate solvability functions (induced on SAT solvers by problem instances in II-, PC-, and the strictly random class uf250-1065, [4]). The instance uf250-1065 087 is the reference instance for the II- and the PC-class [5]. Distributions labeled as exponential strictly satisfy the χ2 -test at 5% level of significance, other distributions are approximations (as outlined in the revised version [6]). (a) The same algorithm (unitwalk) is applied to 128 instances from the II-class and 128 instances from the PC-class. Instances in both classes are replicas of the same reference instance generated under different rules, the solver variability is induced by random selection these instances. In both cases, the instances induce exponential distribution in algorithms’s runtime. A t-test at 5% level of significance shows that the population means of of two distributions are the same – which implies that instances from II-class and PC-class are equally ‘hard’ (as expected by their construction). (b) The four distributions and sample means induced on the four solvers by 128 PC-class instances of uf250-1065 087 are strikingly distinct. (c) Here, we apply four solvers to 100 instances from the original random class. Unlike for the PC-class above, the resulting distributions (and corresponding statistics) cannot be reliably attributed to the performance of SAT solvers alone. Rather, the distributions reflect variability in ‘hardness’ of random test instances themselves [5]. Fig. 1. Case studies of class instances and solvability functions.

128 samples p t-statistics from the formula t = (m1 − p in Figure 1(a). Evaluating m2 )/(σ (1/n1 + 1/n2 )) where σ = ((n1 − 1)s21 + (n2 − 1)s22 )/(n1 + n2 − 2) we 4 find t = 1.169. Since |t| < 1.98 we have to accept, at 5%-level of significance, that the difference between the means of two populations is not significant, and hence the II- and the PC-class are equivalent.

The solvability functions induced by the PC-class uf250-1065 087 PC on the four solvers are shown in Figure 1(b). The variability from 0.1 seconds to over 1000 seconds is not confined to 3-sat instances only. We report similar variability ranges for four solvers across many equivalence classes in [5, 6]. Solvability functions induced by these classes on the four solvers report cases of exponential, near-exponential, normal, near-normal, heavy-tail, and incomplete distributions. It should be noted that the solvability function reported by unitwalk fits the exponential distribution as closely as the one reported for the air-conditioning units [15]. In contrast to the considerable spread of four solvability functions induced by the PC-class uf250-1065 087 PC (where the ‘hardness’ of each instance is the same), consider the ‘clustering’ and the prevalence of ‘heavy-tail’ distributions in Figure 1(c). Here, the ‘random class’ instances induce heavy-tail distribution on three solvers which makes comparisons of statistics between these solvers unreliable. It is far from clear how much of runtime variability has been induced by the ‘hardness’ variability of these instances and how much by the intrinsic variability of the solvers. A wide range in ‘hardness’ variability of these instances has already been demonstrated in [5, 6]. The development of improved SAT solvers relies on the presence of challenging benchmark instances, the reference instances in our set-up. A reliable and systematic study of different reference formulas requires the same set-up as the study of SAT solver performance. Neither can be significantly improved without the other. Both require a flexible experimental test bed such as the environment we describe next.

3

SATbed Architecture and Components

The architecture of SATbed supports an environment that is open both to the SATbed maintainer and the SATbed user. The maintainer can readily install the basic environment with at least one standardized component in each category: A reference instance generator (REFGEN) takes one or more command-line arguments (to designate size, difficulty, etc.) and produces one or more output files in .cnf format with names generated internally by the program. An instance class generator (CLASSGEN) takes a random seed and a .cnf file and produces a single instance of a class. This instance is derived by applying appropriate transformations, e.g., random variable permutation/renaming for a P class. The output, also in .cnf format, is self-documenting — comments give the name of the reference, the starting and ending random seed, and the exact set of transformations used. The SATbed software provides the iteration that produces multiple instances (the configuration file can specify number of instances as an option; the default is 32). A sat solver (SOLVER) takes a single instance in .cnf format and appends raw output data to a file called solverID.raw in a directory that uniquely identifies the set of input files from which the instance comes (either a class or a collection of references created by a reference generator). If the instance is satisfiable, the output includes a satisfying assignment. In any case, a solver will report various statistics such as execution time, number of backtracks, implications, etc. 5

This figure shows the flow of SATbed and a sample configuration script. The script can be used to specify where to start and end processing: to avoid time-consuming re-computation, for example. Naming conventions for files/directories are simple: sched07u PC, for example, is the name of the directory storing instances of the PC class for sched07u. The user must specify the command-line syntax of a reference generator (the line beginning sched =); all other programs are assumed to take two file names (usually input and output) as command-line arguments. A class generator uses the first line of the input file for the random seed; the reference as a .cnf file follows; output is a single class instance with the reference, the starting and ending seed, and the transformation used recorded as comments. A solver reads a .cnf input file and outputs its own raw format; most solvers are executed from encapsulation scripts that rearrange the command-line arguments to conform to SATbed specifications. Each solver must also supply a program called a post-processor that reads the raw output to extract data in SATbed-specific tabular format. STOP

START REFGEN

parameters sched

CLASSGEN

reference instance(s)

SOLVER

raw data

class make_p make_c make_pc seed

POST_PROCESS

STATISTICS

additional statistics tabular data runtime statistics

sato chaff UnitWalk

VIEWER

viewer options

data formatted for documents or viewing

##: demo-1.cfg - a demo configuration file for the @SATbed script START = REFGEN STOP = POST PROCESS # directories for inputs and results BENCHMARK DIR = /mnt/fileserver/SAT-EXP/benchmarks RESULTS DIR = /mnt/fileserver/SAT-EXP/results # program id’s and programs for each stage REFGEN = sched medium s sched medium s = sched classic.tcl 3 7 yes # sizes 3-7, satisfiable CLASSGEN = PC PC = rotate formula PC CLASS SIZE = 10 # default = 32 SEED = 33,1546,1968 # random seed (3 short ints, IEEE 48) SOLVER = chaff,unitwalk chaff = chaff encap.tcl unitwalk = unitwalk encap.tcl TIME OUT = 1800 # stop after this many seconds SOLVE CLASSES = sched medium s # Each program ID in the POST PROCESS stage must have a name # of the form solver pp, where solver is one of the solvers POST PROCESS = chaff pp,unitwalk pp chaff pp = chaff postProcess.tcl unitwalk pp = unitwalk postProcess.tcl Fig. 2. SATbed flow and an example of an experiment-specific configuration file.

6

A sat solver post-processor (SOLVER PP) currently contains a solver-specific component that turns the ‘raw’ solver output into a standardized SATbed tabular format solverID.tab. Fields such as instanceName, resolutionStatus, verificationStatus, timeout, runtime, solutionString are required; optional fields such as flips, implications, backtracks, are solver-specific. Each instance is resolved as sat, unsat, and due to timeout, also unsolved. Currently, SATbed supports post-processors for the following SAT solvers: chaff, dp0, OpenSAT, QingTing, satire, sato, satoL, unitwalk, and walksat. A generic verifier program extracts the reported solution (last column in the tabular format) and upon instance verification, finalizes the file solverID.tab: verificationStatus is reported as verSkip (if no solution is reported), verFail (if the solution does not verify), or verPass (if the solution verifies). This tabular file can be directly imported into any spreadsheet for analysis and plotting and is also available in the html format for ready viewing with a webbrowser. Most importantly, the rows that contain the unsolved and verFail in this tabular file support ”data censoring”3 for SATbed utilities that generate various statistics, including distribution fitting, solvability functions, and box plots. By default, only runtime data is processed by the SATbed statistical package in this stage. The user-invoked statistical applications (STATISTICS) stage allows the user to select any of the optional fields such as flips, implications, backtracks and re-invoke the SATbed statistical package, thereby augmenting the runtime data statistics generated by default during the post-processing stage. Such additional statistics can be useful to identify platform-independent performance metrics common for a category of solvers. For example, we observed a near-perfect correlation of runtime versus implications for a number of DPLL-based solvers [5, 6]. Additional statistical packages, not part of the post-processing stage, can also be invoked to analyze data columns in solverID.tab. Viewer applications (VIEWER) may range from web-browsers to various experimental rendering programs or to commercial statistical packages such as JMP from SAS. Unlike the post-processing and statistics stages, such programs may access data about multiple solvers and/or multiple classes to create a single chart or a table. Examples of such programs include the automated generation of LaTex-formatted statistical reports such as presented in Table 1 and in [6], and pairwise t-test reports discussed in [8]. Once the basic testbed is installed, the maintainer can add more reference instance generators, instance class generators, and sat solvers. If the solver postprocessor is not available, the maintainer will need to create one and add it to the testbed. As to the statistics and viewer applications, the possibilities are limitless. A view of the SATbed architecture as a simple chain of its components is shown in Figure 2. Components targeted for execution in this chain are selected with START and STOP options in the experiment-specific configuration file. Such self-documenting files are maintained by the user for each of the experimental 3

Data samples in the row that contains either an unsolved of an verFail entry are not included in SATbed-specific statistical reports.

7

designs that are to be executed. The name of this file is the only argument expected when invoking SATbed from the command line. Figure 2 shows some of the typical lines that configure the flow of the experiment before its invocation. For more details, see [17]. The architecture as we conceived it also allows substitution of back-end components by the maintainer. Taking advantage of parallel or distributed processing capabilities to execute several solvers or several class instances simultaneously is one application where this has clear advantages. We turn now to a family of benchmarks instances whose refinement relies heavily on the SATbed approach.

4

Sched Class Generation

A collection of new benchmark problem instances whose difficulty we are investigating is based on a unit-length-task scheduling problem, defined as follows. Consider a set T of n tasks to be mapped to time slots 1, . . . , n on a single processor, and let S(t) denote the slot to which task t is assigned by the schedule (mapping) S. There are two constraints on the problem. One is that each task t has a deadline d(t) that imposes the condition S(t) ≤ d(t). The other is a precedence graph G = (T, P ) with tasks as nodes and where a directed edge tu ∈ P means that S(t) < S(u). We refer to the problem of determining whether a feasible schedule exists as 1-PCS (1-processor precedence-constrained scheduling). The left picture in Figure 3 shows a top-down view of the precedence graph we are currently considering, and the right picture shows the graph for sched03, one example among a whole class we call recursive N-graphs. The right picture also shows the recursive part of the definition of these graphs; for larger instances, each node of one of the smaller N-graphs could, in turn, be another N-graph. In the example shown, the tasks labeled cc would have to be scheduled, from top to bottom, in slots 1–4 to meet the deadline (4) of the bottom-left task. These would have to be followed by the three sc tasks (deadline of 7), then the cs (11) and ss (13). To turn an arbitrary 1-PCS instance into an instance of SAT, we introduce variables si,j for 1 ≤ i, j ≤ n with the interpretation that si,j is true whenever task i is scheduled in slot j. We need only three types of clauses. 1. Each task must be scheduled by its deadline: clauses (si,1 ∨ si,2 ∨ · · · ∨ si,d(i) ) for 1 ≤ i ≤ n, 2. At most one task can be scheduled in any slot: clauses (si,k ∨ sj,k ) for 1 ≤ k ≤ n and for each pair i, j with d(i), d(j) ≤ k, and 3. Precedence constraints must be observed: clauses (st,i ∨ su,i+1 ∨ · · · ∨ su,d(u) ) when tu ∈ P and for 1 ≤ i ≤ d(t). Everything said so far assumes we are generating satisfiable instances of scheduling. An unsatisfiable instance is created if the latest deadline is decreased by 1. The scheduling instances in our current experiments are named schednns (satisfiable) or schednnu (unsatisfiable), where nn is a two-digit number specifying the size of the precedence graph. The instance sched00 (both s and u) uses 8

The figure on the left shows a top-level view of a recursive N-graph. Four nodes, N0 , N1 , N2 , and N3 , are joined via three directed edges, N0 N1 , N0 N3 , and N2 N3 . The left path, N0 N1 , is the critical path (C), and N2 N3 the secondary path (S). As illustrated on the right, each of the nodes can either contain another recursive N-graph (as is the case with N0 , N1 , and N2 ) or be a single node (N3 ). In general an edge between two nodes, such as the edge N0 N1 , results in three edges between the nodes inside N0 and N1 . Two of these connect the last node on the critical path inside N0 with the first nodes of both paths inside N1 ; the third connects the last node of the secondary path inside N0 with the last node of the secondary path inside N1 . The edges N0 N3 and N2 N3 each yield only one edge because N3 only has one node, which becomes its critical path by default. For scheduling instances we identify four distinct paths through an N-graph having more than one layer (that is, one in which not all nodes are singletons) — cc (critical-critical) follows the critical path at the top level and critical paths within all nodes encountered; sc (secondary-critical) follows the secondary path at the top level and critical paths in the nodes encountered; cs and ss are analogous. N0 N0

N2

cc

cs

sc

ss

cs

sc

ss 13

C

S

cc

C

S

cc

N1

N2

cs sc

cc

N3

4

cs 11

N1

7

N3

Fig. 3. The recursive N-graph used for a satisfiable sched03 instance.

a simple 4-node, 3-edge N-graph. In general, if nn = 4k + r, where 1 ≤ r ≤ 4, then schednn uses a recursive N-graph in which nodes Ni contain the graph for size 4k when i < r and the graph of size 4(k − 1) (or a single node if k = 0) when i ≥ r. The 1-PCS problem is not NP-complete — in fact, PCS does not become NP-complete until the number of processors is three [18, p. 239]. It is easy to generalize the second set of clauses to deal with any fixed number of processors. With three processors, for example, the clauses would take the form (sg,k ∨ sh,k ∨ si,k ∨ si,k ) for 1 ≤ k ≤ n and for all quadruples g, h, i, j with d(g), d(h), d(i), d(j) ≤ k. We anticipate further experiments with different types of precedence graphs (the recursive construction for the N-graphs can easily be generalized to any graph that has two paths), more processors (the simplest way to create multiprocessor instances is to make parallel copies of the precedence graphs, or use multiple different graphs with the same number of nodes), and/or variations in deadline setting. Such experiments are greatly facilitated with our testbed. 9

In the original scheduling instances (discussed, e.g., in [5]) we made two adjustments to the deadlines before converting to SAT: (a) whenever a task t precedes a task u, we make d(t) ≤ d(u) − 1, if it isn’t already, and (b) the earliest deadline of a node with no successors is increased by 1. The purpose of adjustment (a) is to eliminate unnecessary variables; adjustment (b) avoids unit clauses caused by “tight” precedence constraints. Another set of scheduling instances was recently created without either of these adjustments — unit clauses were simply eliminated by propagation and the resulting instances had fewer variables (in fact, sizes 00–02 had no clauses remaining after unit propagation). Table 1 illustrates the dramatic difference between these two types of instances, as well as the unusual rank ordering of the solvers.4 Whereas instances sched07s v01386 and sched07u v01384 are from our original benchmark set, instances sched04z07s v00655 and sched04z06u v00350 are from the new set. Despite the smaller number of variables, the new instances appear to be harder to solve, so much so that we were unable to use size 07 (with 649 variables) on the unsatisfiable side (even size 06 caused timeouts almost in almost every case but size 05, with 111 variables, was too trivial — runtimes were around 0.01 seconds). The original scheduling instances provide a much more significant challenge for chaff than for other solvers (except that walksat appears completely unable to find solutions in reasonable time for the satisfiable case). This is in contrast to chaff ’s superior performance in other difficult problem domains, such as the blocks world [5, 6]. The erratic behavior of sato may be due to bugs in the implementation (some of these are noted by the author [14]), while satoL, a related solver with simpler data structures,5 has not only the fastest, but also the most predictable, execution times. The tables are radically turned when we turn to the new scheduling instances (655 variables and 8623 clauses for the size 07 satisfiable one, 350 variables and 3394 clauses for the size 06 unsatisfiable one). Tied for best solver in the satisfiable case is walksat, which was not even a contender with the original instance. The clear loser is satoL, the fastest solver for the original instance; it times out more often than not and runs hundreds of seconds even when it doesn’t. As already pointed out, the relatively small unsatisfiable new scheduling instance appears to be too challenging (within our time-out limit of 1000 seconds) for any of the solvers we tested. We will fill in more detail with longer runs (and/or faster processors) for the final version of this paper. A key point about our work with the scheduling instances is our ability to use SATbed to quickly create and report experimental results for a whole family of related instances of different sizes. A family of scheduling instances might consist of the (PC classes of the) five instances sched04s–sched08s. These range from 140/892 variables/clauses to 2088/47080. The smallest is suitable for detailed tracing while the largest is challenging for even the fastest solvers. Five size levels are also enough to observe important asymptotic trends, e.g., a solver that is not competitive on the given instances may have an execution time that grows significantly more slowly than that of other solvers, an observation that would make it a solver of choice given either a faster machine or a faster implementation. 4 5

Runtime is on a 350MHz Power PC (Apple MacIntosh G3) with 384MB memory. This is sato run with the -d1 switch.

10

The development of the new, harder scheduling instances went through four revisions before arriving at the current level of challenge. This process would have been nearly impossible to manage without SATbed.

5

Current Postings on the Web

A well-defined schema is required to manage large volumes of input data sets and repeated executions of several solvers, each writing results in solver-specific formats. Two components of the experimental design schema (EDS) that evolved Table 1. SAT solver comparisons on satisfiable and unsatisfiable instances in PC classes of two sched families. costID = timeToSolve (seconds) – sat instances solverID satoL unitwalk sato chaff walksat(1)

solverID sato(2) walksat unitwalk chaff

Class labels: name=sched07s v01386, type=PC, size=32 initV minV meanV maxV max/min stdev 0.08 0.04 0.07 0.13 3.25 0.02 0.06 0.06 0.41 2.33 38.8 0.47 0.11 0.09 1.28 37.1 412 6.54 16.7 1.49 11.0 36.3 24.4 7.49 1000 1000 1000 1000 1.00 0.00

distribution near-normal exponential heavy-tailed exponential incomplete

Class labels: name=sched04z07s v00655, type=PC, size=32 initV minV meanV maxV max/min stDev 0.31 0.04 0.60 2.31 57.8 0.64 2.27 0.03 0.82 2.34 78.0 0.63 1.86 0.05 2.37 12.2 244 2.62 1.98 0.20 30.2 165 825 43.1

distribution exponential exponential exponential exponential

costID = timeToSolve (seconds) – unsat instances solverID satoL sato(3) chaff

Class initV 0.18 0.18 5.68

labels: name=sched07u v01384, type=PC, size=32 minV meanV maxV max/min stDev 0.18 0.28 0.31 1.72 0.02 0.18 0.24 0.28 1.56 0.02 1.63 9.13 27.8 17.1 6.27

solverID chaff(4) sato(4) satoL(4)

Class labels: name=sched04z06u v00350, type=PC, size=8 initV minV meanV maxV max/min stDev distribution 1000 1000 1000 1000 1.00 0.00 NA 1000 1000 1000 1000 1.00 0.00 NA 1000 1000 1000 1000 1.00 0.00 NA

(1)

distribution near-normal incomplete exponential

All attempted instances timed out at 1000 seconds, but some runs are still pending. One instance timed out at 1000 seconds (possibly a bug in the solver); the statistics are for the remaining 32. (3) Three instances timed out at 1000 seconds (possibly a bug in the solver); the statistics are for the remaining 30. (4) All attempted instances timed out at 1000 seconds. (2)

11

benchm SATcnf + bw large s + bw large u ... + queen medium - queen small - @references queen04 v00016.cnf queen04 v00025.cnf queen04 v00036.cnf ... ... - queen 04 v00016 - queen04 v00016 C i0000.cnf i0001.cnf ... + queen04 v00016 I + queen04 v00016 P + queen04 v00016 PC + queen 04 v00025 + queen 04 v00036 ... + sched s + sched u ... ...

results SATcnf + chaff + dp0 nat ... + satire - sato + bw large s ... + queen medium + queen small - @references sato.raw sato.html ..... - queen 04 v00016 - queen04 v00016 C sato.raw sato.html .... + queen04 v00016 P + queen04 v00016 PC + queen 04 v00025 ... + sched s ... + satoL ...

Fig. 4. Organization of benchmark classes and experimental results in SATbed.

in this work are shown in Figure 4: benchm SATcnf archives all input data sets, results SATcnf archives results of every experiment generated by each solver. As discussed in Section 3, the initial results of experiments are in solver-specific (.raw) formats and are post-processed into several formats and posted on the Web. An example of such postings is shown in Figure 5: three solvers, dp0 nat, chaff, and sato have been applied to class instances from sched03s v0095 (introduced in this paper). Note that distributions of reported variables range from exponential to normal, heavy-tail, as well as impulse (0 backtracks are reported for sato for all instances under test). Since dp0 nat is executed under a Tcl interpreter (and relatively slow compared to programs in C), the underlying solvability graph displays the number of implications required by each solver. The crossover of dp0 nat and chaff makes this benchmark class (and others in the sched family) of particular interest for further study. To access compressed archives of input data sets under benchm SATcnf and raw results, statistical summaries, and tabulated data under results SATcnf, and the SATbed with complete documentation and running examples, reader is invited to follow the links posted on the home page http://www.cbl.ncsu.edu/OpenExperiments/SAT/

6

Conclusions

In order for the equivalence-class methodology proposed in [5, 6] to gain wide acceptance, the generation of equivalence classes, the execution of multiple solvers on these classes, and the extraction of meaningful data must be simplified as much as possible. The SATbed software described here is, we believe, an im12

(a) dp0 nat.html file (table of samples) extracted from dp0 nat.raw file

(b) dp0 nat stat.html file (table of statistics) extracted from dp0 nat.html

(c) chaff stat.html file (table of statistics) extracted from chaff.html

(s) sato stat.html file (table of statistics) extracted from sato.html

(e) solvability functions computed on samples from three solvers above

Fig. 5. Experimental results reported by three SAT solvers for instances from the class of sched03s v0095 PC. Tables are posted on the Web by SATbed as experiments unfold.

13

portant step in this direction. When installed, users will be able to experiment readily with new solvers, and do so with reliability and statistical significance. They will be able to fairly compare solvers in previously hard-to-compare categories, such as stochastic versus deterministic. They will also be able to develop and refine new, more difficult benchmark instances and class-generation methodologies. Finally, large experiments on families of related instances will become more manageable than ever before. Setting up, running, and analyzing data for the required experiments on a family of instances can be a daunting task, error-prone if done manually. Not only does SATbed make gathering data for a family of instance classes manageable, it also allows researchers to gain insights from variations of the instance characteristics. The process of creating and refining “hard” instances of SAT, if done via careful experimentation on instance classes and families, yields important insights that cannot be gained from the development of better solvers, but could lead to the latter in the long run. We invite researchers to check out the tutorial about benchmark equivalence classes, results of our experiments, and a brief guide on using SATbed [17]. This testbed will allow researchers not only to replicate any of web-posted experiments but also to contribute new class instances and share new experimental results – which again can be replicated by others. Acknowledgments. The experiments, as reported in this paper, could not have taken place without SAT solvers such as chaff, satire, sato, satoL, unitwalk and walksat. We thank authors for the ready access and the exceptional ease of installation of these software packages. We also thank Dr. Jason Osborne from the Department of Statistics (NCSU) for the advice on the likelihood ratio test statistic for exponential distributions. We appreciate the re-assurance that in our experiments with 128 samples, the simpler t-test statistic has sufficient power to resolve the hypothesis of equal population means of two exponential distributions.

References 1. Laurent Simon. Sat-Ex: The experimentation web site around the satisfiability , 2003. For more information, see http://www.lri.fr/~simon/satex/satex.php3. 2. Michael A. Trick. Second dimacs challenge test problems. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, 26:653–657, 1993. The SAT benchmark sets are available at ftp://dimacs.rutgers.edu/pub/challenge/satisfiability. 3. Henry Kautz, David McAllester, and Bart Selman. Encoding plans in propositional logic. KR’96: Principles of Knowledge Representation and Reasoning, pages 374–384, 1996. The SATPLAN benchmark set is available at http://sat.inesc.pt/benchmarks/cnf/satplan/. 4. SATLIB - The Satisfiability Library, 2003. See http://www.satlib.org. 5. F. Brglez, X. Y. Li, and M. Stallmann. The Role of a Skeptic Agent in Testing and Benchmarking of SAT Algorithms. In Fifth International Symposium on the Theory and Applications of Satisfiability Testing, May 2002. Available at http://www.cbl.ncsu.edu/publications/ . 6. F. Brglez, X. Y. Li, and M. Stallmann. On SAT Instance Classes and a Method for Reliable Performance Experiments with SAT Solvers. Annals of Mathematics and Artificial Intelligence (AMAI), Special Issue on Satisfiability Testing, 2003. Under

14

7.

8.

9. 10. 11. 12. 13.

14. 15. 16. 17. 18.

review. Submitted to AMAI as the revision of the paper published at the Fifth International Symposium on the Theory and Applications of Satisfiability Testing, Cincinnati, Ohio, USA, May 2002. Available at http://www.cbl.ncsu.edu/publications/ . X. Y. Li, M. Stallmann, and F. Brglez. QingTing: A Fast SAT Solver Using Local Search and Efficient Unit Propagation. In Proceedings of SAT 2003, Sixth International Symposium on the Theory and Applications of Satisfiability Testing, May 5-8 2003, S. Margherita Ligure - Portofino, Italy, May 2003. Available from http://www.cbl.ncsu.edu/publications/ . X. Y. Li, M. F. Stallmann, and F. Brglez. QingTing: A Local Search SAT Solver Using an Effective Switching Strategy and an Efficient Unit Propagation. SpringerVerlag Lecture Notes in Computer Science (LNCS), Special Issue on Satisfiability Testing, 2003. Under review. Submitted to LNCS as the revision of the paper published at the Sixth International Symposium on the Theory and Applications of Satisfiability Testing, S. Margherita Ligure - Portofino ( Italy), May 5-8 2003. Available from http://www.cbl.ncsu.edu/publications/. David A. McAllester, Bart Selman, and Henry A. Kautz. Evidence for invariants in local search. In AAAI/IAAI, pages 321–326, 1997. E. Hirsch and A. Kojevnikov. UnitWalk: A new SAT solver that uses local search guided by unit clause elimination, 2001. PDMI preprint 9/2001, Steklov Institute of Mathematics at St.Petersburg, 2001. Holger H. Hoos and Thomas St¨ utzle. Evaluating Las Vegas Algorithms – Pitfalls and Remedies. In UAI-98, Morgan Kaufmann Publishers, pages 238–245. Morgan Kaufmann Publishers, 1998. Holger H. Hoos and Thomas St¨ utzle. Local Search Algorithms for SAT: An Empirical Evaluation. Journal Of Automated Reasoning, 24, 2000. Matthew Moskewicz, Conor Madigan, Ying Zhao, Lintao Zhang, and Sharad Malik. Chaff: Engineering an efficient SAT solver. In IEEE/ACM Design Automation Conference (DAC), 2001. Version 1.0 of Chaff is available at http://www.ee.princeton.edu/ chaff/zchaff/zchaff.2001.2.17.src.tar.gz. Hantao Zhang. SATO: An efficient propositional prover. In Conference on Automated Deduction, pages 272–275, 1997. Version 3.2 of SATO is available at ftp://cs.uiowa.edu/pub/hzhang/sato/sato.tar.gz. F. Prochan. Theoretical explanation of observed decreasing failure rate. Technometrics, 5:375, 1963. F. Jense. Electronic Component Reliability: Fundamentals, Modelling, Evaluation, and Assurance. J. Wiley, 1996. F. Brglez, M. Stallmann, and X. Y. Li. SATbed Home Page: A Tutorial, A User Guide, A Software Archive, Archives of SAT Instance Classes and Experimental Results, 2003. Available at http://www.cbl.ncsu.edu/OpenExperiments/SAT/ . M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979.

15