Cheap and Small Counterexamples - Semantic Scholar

2008 Sixth IEEE International Conference on Software Engineering and Formal Methods

Cheap and Small Counterexamples Henri Hansen Institute of Software Systems, Tampere University of Technology PO Box 553, FI-33101 Tampere, FINLAND [email protected] Jaco Geldenhuys Computer Science Division, Department of Mathematical Sciences Stellenbosch University, Private Bag X1, 7602 Matieland, SOUTH AFRICA [email protected]

Abstract

Minimal counterexamples are desirable, but expensive to compute. We propose four algorithms for computing small counterexamples that approximate the shortest case. Three of these use a new algorithm for automata-theoretic lineartime model checking, based on an early algorithm by Dijkstra for detecting strongly connected components. All four of the approximation algorithms rely on transitions shuffling; we show that the default transition ordering can behave badly. The algorithms are compared to a standard shortest counterexample algorithm on a large public data set.

iP P

ρ2

ρ3

PP q

Figure 1. A lasso-shaped counterexample

of automata. Their synchronous product is computed onthe-fly (i.e., as needed), and an exploration of the product automaton reveals whether the original model satisfies the correctness property. In the context of automata-theoretic linear-time model checking, a counterexample is a lassoshaped path, as shown in Figure 1. The counterexample length is the sum of the number of transitions on the three paths: ` = |ρ1 | + |ρ2 | + |ρ3 |. Longer counterexamples may in some cases convey more information to the user, but it is not a trivial task to distinguish useful from useless information [3, 32, 33, 49]. The survey by Fraser, Ammann, and Wotawa is an excellent introduction to the use of model checkers in testing [20]. All things being equal, shorter counterexamples are more desirable than longer ones. The problem of minimal (in other words, shortest) counterexamples has been studied before. Gastin, Moro, and Zeitoun described a depth-first search algorithm with exponential worst-case time consumption [25]. Hansen and Kervinen [34] instead proposed an algorithm based on interleaved breadth-first searches with O(nm) runtime, where n is the number of states and m is the number of transitions. Unfortunately, it requires the backward exploration of transitions, making it unsuited for on-the-fly verification. Recently, Gastin and Moro presented a forward-only algorithm that uses priority queues, and adds only a logarithmic coefficient to its otherwise similar time consumption [24]. In this paper we examine the possibility of finding short

1. Introduction In spite of decades of research, software testing remains a somewhat ad hoc activity [5, 7]. Formal methods, model checking in particular, are not widely used, but there are nevertheless several success stories that demonstrate its value [4, 29, 39, 45, 51]. This value derives not only from the fact that building a model of software can lead developers to a different and deeper understanding of the software, but also that, in the case of model checking, it can easily and cheaply produce counterexamples – which are essentially executions – that demonstrate potential errors in the software and/or deviations in the model. In this setting, the generation of short counterexamples that are readily translatable to actual test cases, is a distinct advantage. In short, model checking is an automatic technique that verifies that a formal model of a software/hardware system (encoded as a Kripke structure) satisfies a given property, expressed as a temporal logic formula (encoded as a Büchi automaton) [12]. The two encodings are specialized forms

978-0-7695-3437-4/08 $25.00 © 2008 IEEE DOI 10.1109/SEFM.2008.18

ρ1

53

counterexamples (with small ` values) without too much extra work. Some earlier research has also addressed this issue: directed model checking [19, 40] makes use of heuristics, the A∗ algorithm in particular. The idea of using iterative deepening has also been explored in the context of directed model checking [36], and we shall make use of a similar idea. Shortest counterexamples have also been studied in other contexts, such as symbolic model checking [11, 37, 41, 44]. Many approaches aim at reducing the work required to explore the product automaton. Those that focus on the core algorithm are of particular relevance to the ideas presented in this paper. This includes our own work on testing automata [27] and work that deals with generalized Büchi automata [15, 47]. The rest of this paper is organized as follows. In Section 2 we introduce a model checking algorithm based on Dijkstra’s algorithm for maximal strongly connected components. Section 3 describes the various algorithms in more detail, and in Section 4 we empirically investigate their performance in terms of counterexample size and transitions explored. Section 5 presents our conclusions and outlines future research.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

DIJKSTRA (init) stack ← ({init}, GETTRANS(init)) while ¬stack .ISEMPTY do (S, T ) ← stack .TOP if T 6= ∅ then t ← remove any from T (and update T on stack ) d ← destination state of t if d ∈ S for some (S, T ) ∈ stack then S0 ← T 0 ← ∅ while d 6∈ S 0 do (S 00 , T 00 ) ← stack .POP if S 00 = {s} ∧ ISACC(s) then report error S 0 ← S 0 ∪ S 00 T 0 ← T 0 ∪ T 00 stack .PUSH(S 0 , T 0 ) else if ¬ ISMARKED(d) then stack .PUSH({d}, GETTRANS(d)) else stack .POP foreach s ∈ S do MARK(s)

Figure 2. The Dijkstra algorithm for detecting accepting cycles

2. Dijkstra’s SCC algorithm Algorithms for explicit-state on-the-fly model checking are usually based on either nested depth-first search (such as Courcoubetis et al. and its variations [13, 35, 42]) or the detection of strongly connected components (SCCs). There are four main algorithms for SCC detection: Kosaraju’s algorithm [43], Tarjan’s algorithm [46], the Cheriyan/Mehlhorn/Gabow path algorithm [10, 22], and Dijkstra’s algorithm [16]. We mention Kosaraju first because it is the least like the others. Unfortunately, it is not suitable for on-the-fly model checking because of the need to explore transitions backwards. The path-based algorithm is also not considered because it is essentially Tarjan’s algorithm with more sophisticated data structures. Dijkstra’s algorithm for SCCs is probably the least known, but it has an attractive property: it does not require a pure depth-first search. Pseudocode for a model checking algorithm based on Dijkstra’s algorithm in shown in Figure 2. It is identical to Dijkstra’s original algorithm except for the inclusion of line 11. In this code, stack is a last-in-first-out stack where each entry is a tuple consisting of a set of states and a set of transitions. The state sets are the partial SCCs that the algorithm has recognized so far. The GETTRANS function returns all the transitions of a given state, and ISACC returns a state’s accepting status. It is not difficult to show the correctness of the algorithm; we shall not do so here, but it is not difficult to adopt the same basic approach taken in [28].

The Dijkstra and Tarjan algorithms appear similar on the surface, but differ in one significant regard: the way in which the next transition to explore is selected (line 5 of Figure 2). In Tarjan, a transition is chosen from among the unexplored transitions of the deepest state, i.e., the state most recently added to the stack. However, a weaker restriction applies in Dijkstra: the next transition is chosen from among the unexplored transitions of the deepest SCC. In other words, the choice of transition is much broader when the most recent SCC contains more than one state, and the Dijkstra algorithm includes the Tarjan algorithm as a special case. One can, as we shall describe shortly, choose a transition from the shallowest state of the SCC. Here the depth of a state is defined in the usual manner: the initial state has depth 0 and whenever a new state is explored, its depth is one greater than the depth of its parent. The intent of Figure 2 is to present the code compactly and legibly. Efficient implementations are possible and some of the mechanisms required have already been described elsewhere [14, 28]. Couvreur, in fact, reinvented most of the algorithm, but without investigating non-depthfirst search. This is, therefore, the first time that the algorithm has been presented explicitly and also the first time that the non-depth-first search feature is exploited.

54

@ R

3. Algorithms for Short Counterexamples

The shortest counterexample algorithms mentioned in the introduction are expensive, and prohibitively so in the case of large models. The Gastin/Moro/Zeitoun algorithm re-explores states when it discovers that they can be reached by shorter paths and its time consumption can be exponential in terms of the number of states and transitions [25]. Even though the breadth-first algorithms proposed in the literature [24, 34] have a worst-case running time that is only cubic in the number of states, they too are often impractically slow and may explore up to Θ(nm) transitions before finding a single counterexample. This raises the question of whether it is possible to produce short counterexamples – but not necessarily the shortest – while avoiding the high overhead associated with the above algorithms. In this section we describe four algorithms for this task, and we refer to them as approximate algorithms. (We are using the term approximation informally and not in its computational complexity sense.) We attack the problem in three ways: by shuffling the transitions, abandoning pure depth-first search order, and limiting the search depth.

? x ? ? i P Figure 3. “Hard” case for iterative deepening Choosing transitions from the shallowest state of the top SCC is a heuristic strategy aimed at exploring shorter paths than depth-first search. Here shallowest means the state with the smallest depth-first number, not necessarily the state that is closest to the initial state. We implemented this, combined with transition shuffling, and call it min d.

3.3. Depth-limited Search 3.1. Transition Shuffling Our third strategy is to employ iterative deepening: a strict depth limit is applied to the search, and in each successive iteration the limit is relaxed, until a counterexample is found. Note, however, that this does not guarantee that a shortest counterexample will be found. Consider, for instance, the graph in Figure 3. If the leftmost transition of the initial state is explored first, a depth-first-based algorithm will miss the shorter counterexample, because the state marked with x will be fully expanded by the time the state is reached via the rightmost transition. However, iterative deepening can be used for approximation purposes, and, when random shuffling is used, it has the possibility of finding shorter paths when we run it again. Therefore, we implemented it in the following way:

Most model checker implementations are deterministic in the sense that they produce exactly the same result when run more than once. This is not true of the underlying algorithms: the order in which the transitions of a state are explored is not fixed by any of the well-known algorithms. Our first line of attack is to switch from deterministic to randomized algorithms by shuffling the order in which transitions are explored. When applied to the nested search algorithm by Schwoon and Esparza [42], the result is an implementation we call min se. It runs the basic algorithm a specified number of times with different random seeds and records the shortest counterexample encountered.

3.2. Non-depth-first Search

1 2

Standard SCC-based and nested search algorithms explore the product automaton in a depth-first manner, even if the latter implicitly duplicates the search space. By employing the Dijkstra algorithm described in the previous section, we can break from pure depth-first ordering. When the transitions of a non-deepest state in the top SCC are expanded first, the result is, generally speaking, not a depthfirst search tree. This means that the algorithm is able to investigate search trees that are not available to any of the traditional algorithms.

3 5 4 5 6

obtain counterexample of length c d ← 1; n ← 0 while n < N do if d = c then n ← n + 1; d ← 1 run limited search with depth limit d if counterexample then c ← d; d ← 1 else d ← d + 1

The idea is, that once the depth limit reaches the length of the current counterexample, we start again from scratch. Our approach sets a limit (N ) for the total number of resets, but it is really a matter of taste how the resets are limited.

55

30

We implemented this in conjunction with the shallow-first version of Dijkstra’s algorithm and call it ite d. We also investigated another kind of depth-limited search. After an initial counterexample has been detected, the depth limit is set to half its length. The algorithm is run again, and if a new counterexample is found the limit is again halved. If no new counterexample is detected, the limit is set to the average of the previous limit and the shortest counterexample found so far. The following lines encode this procedure:

1 2 3 5 4 5 6

Transitions (millions)

25

aa a

a

a

a

a

..

..

..

..

. ..

..

..

..

.

. ..

..

..

..

..

. ..

..

..

..

..

a

a .. a . .. .. aa .. ...... . ........... ............ a aa a ...... ............ a ............ . 10 . . . . . . . . . . a . a .. a ............ . . aa ........... a a .. a .............................. a a a a a . . .aa. . . ............ . . . . . . . a . . . ..... ............ 5 aa a a . . . . a a a. a. . . a .a........................................... .. . . a . . . . . . . . . a . . a. a.......... .aaa. a..a............ .a..a.a........ 15

a

2

obtain counterexample of length c d ← dc/2e; n ← 0 while n < N do if d = c then n ← n + 1; d ← dc/2e run limited search with depth limit d if counterexample then c ← d; d ← dc/2e else d ← d(c + d)/2e

a

a

20

a

. ..

4

6

8

10

12

States (millions)

Figure 4. Distribution of models employed

4. Comparison In addition to the four algorithms described above, we have also implemented gm, the Gastin/Moro algorithm described in [24]. The algorithms are applied to 196 of the 227 models of the BEEM repository [6], that produce counterexamples. These models include Büchi automata generated with the LTL 2 BA program [26]. Many of the models are parameterized versions of others; our data set contains 26 different “base” models. (A further 36 BEEM base models do not contain counterexamples.) The models range in size from 107 to 11.2 million states, and from 299 to 27.5 million transitions. The distribution of models in terms of states and transitions is shown in Figure 4. The slope of the line between a point and the origin gives the density (transitions per state) of the model represented by the point. The solid line indicates a density of 1, while the dotted line indicates the median density of 3. Densities range from 1.1 to 12.8 with an average of 3.7. Furthermore, the state spaces we generate correspond exactly to those reported in the BEEM repository. In the results that follow, the randomized algorithms were run with three different repetition parameters: 10, 50, and 100 for min se and min d, and 2, 5, and 10 for ite d and bin d. Each of these twelve variations were run nine times with different random seeds. The exact gm algorithm is deterministic and was therefore run only once, seeded with the smallest counterexample found by the approximate algorithms. For comparison, we also ran two algorithms with transitions explored in the order of declaration, se (the Schwoon/Esparza nested search algorithm [42]) and d (the shallow-first Dijkstra algorithm from Section 2). The first step in assessing the results is to classify the models and counterexamples as “easy”, “moderate” or “hard”. Table 1 summarizes how well a particular subset of the approximate algorithms fares. As the last row of

In other words, the algorithm performs a kind of binary search for a short counterexample. Because of the counterexample-missing problem described above, the whole procedure is repeated a number of times. One benefit of the binary deepening over iterative deepening is that the former will never incur more than logarithmic extra cost. We have combined this with transition shuffling and the Dijkstra algorithm in an implementation called bin d.

3.4. Summary We have now described four approximate algorithms (min se, min d, ite d, and bin d) that we believe improve on previous, exact algorithms. The question of exactly how well they perform in practice is addressed in the next section. There are however two important remaining observations to make. Firstly, both the exact and approximate algorithms can provide feedback to the user to indicate the size of the shortest counterexample found thus far. This allows the user to interrupt the algorithm when the counterexample grows “short enough”. Moreover, the user can easily specify limits on the number of states and transitions the algorithm is allowed to explore, or, more directly, on the time and memory it is allowed to consume. Secondly, it is unclear whether partial order reduction methods [30, 38, 48] can be applied to any of the exact algorithms; breadth-first search presents special challenges in this regard [1, 9, 36] and it is not even certain that such reduction will preserve the correctness of the algorithms. On the other hand, all of the approximate algorithms we have described here are amenable to partial order reduction and its variations.

56

shown here, ite d outperforms the other algorithms on the bakery.2.4 model. The worst cases for se are clearly worse than for d. The average counterexample produced by se is 9.36 times longer than minimal, while for d, the number is just 2.32. Interestingly, d seems to benefit much less from transition shuffling than se. In some cases the default order is actually much better than a shuffled order (szymanski.4.3), but it is never better than ite d or bin d. Although it is common practice to report bytes and seconds consumed, we choose not to do so. We believe that the number of states stored and the number of transitions explored give a more accurate indication of the memory and time consumption, independently of a particular data structure and of the processor architecture and clock speed. For example, because of the large number of experiments needed, the state spaces for the models are pre-generated and each individual state is encoded as a single integer. However, the reported results would look the same if the algorithms were implemented in a production-level model checker such as SPIN. The bottom half of Table 2 relates the cost, in terms of transitions explored, of the algorithms when run on the same models listed in the top half. The row marked “Transitions” gives the number of transitions in each of the models; the models are sorted in ascending number of transitions, and the other numbers are ratios with respect to this number. This ratio is, in many instances, greater than 1 because the algorithms re-explore some transitions. We should now point out that the models in this table were selected to include the worst-case ratio for each of the algorithms. For comparison, two models were added: szymanski.4.3 in the second-last column is one of the largest models in the data set (with more than 25 million transitions), and iprotocols.3.4 in the fourth-last column is one of the 3.57% of models for which none of the algorithms find the shortest counterexample (column “none” of Table 1). Although there is some variation in the ratios, it is not surprising that the worst-case models coincide. From the table it appears that bakery.2.4 and bakery.1.2 are examples of “difficult” models for all of the algorithms, whereas lifts.4.2 represents a relatively trivial model. While extinction.4.2 is challenging for min se and min d, they explore only a small number of transitions of the elevators2.1.5 model compared to ite d and bin d. The extra precision of ite d comes at a relatively high cost. In the average case, ite d finds counterexamples that are only about one percent longer than the shortest, but so does bin d. On the other hand, ite d is on average more than four times as costly as bin d; in the worst cases, this coefficient is close to ten. By their nature, all these algorithms can run for as long

Table 1. Success rates for approx. algorithms

Algorithm

Minimal counterexample found by. . . . . . all . . . one or more . . . none

min se.10 min se.50 min se.100

12.24% 27.55% 32.65%

40.82% 47.45% 55.10%

59.18% 52.55% 44.90%

min d.10 min d.50 min d.100

12.76% 28.57% 35.20%

41.33% 47.96% 56.12%

58.67% 52.04% 43.88%

ite d.2 ite d.5 ite d.10 bin d.2 bin d.5 bin d.10 All

77.55% 82.65% 87.76% 67.35% 75.51% 84.69% 10.71%

93.88% 95.41% 95.41% 92.86% 91.84% 92.86% 96.43%

6.12% 4.59% 4.59% 7.14% 8.16% 7.14% 3.57%

the table shows, in 10.71% of cases, all nine runs of all of the approximate algorithms find a shortest counterexample. These cases include some short counterexamples (about one third contain ≤ 5 states), but also longer ones (including 4 counterexamples of length 260). For 96.43% of the models at least one of the approximate algorithms finds a shortest counterexample, and only in 3.57% of cases do all the algorithms fail to do so. The table also reveals the relative strengths of the algorithms: min d is slightly more adept at finding short counterexamples than min se, but both are clearly outperformed by ite d and bin d. Of the latter two, ite d is the more accurate. Before we consider the cost of the algorithms, it is reasonable to ask how badly the algorithms fail when they do. The top half of Table 2 shows some examples of the worst performances of each algorithm and repetition count for various models. (The model names, shown at the top of the table, correspond to BEEM names.) The first row marked “Minimal length” is the size of the shortest counterexample, while the numbers in the other rows are expressed as ratios with respect to this size. For example, the shortest counterexample found by min se.10 for the lifts.4.2 model is 13.76 times the length of the shortest counterexample (which has length 10). The largest ratio in each row is indicated in bold, and the average ratio for each algorithm over all of the models appears in the last column. As before, min se and min d fare worse that ite d and bin d. (For gm, the entries are all 1.00, of course.) In the worst cases for the latter algorithms (bakery.2.4 and protocols.1.4, respectively) the ratio is relatively small and quite close. Interestingly, even in the worst case

57

Table 2. Algorithm performance as ratio w. r. t. shortest counterexample/number of transitions 2 3 5 .4. .3 3.4 be.2. i.4.3 1.4 2.1. 4 n . 4 2 . . 2 o l . . s co .2 ri .1 sk ol e tor ts.4. her.5 incti her.4 oto subsc zyman verag ery akery rotoc leva ret ret lif ext e ipr bak p b s p A

Minimal length se d

58 5.00 1.52

88 3.16 1.25

18 4.06 4.06

13 5.38 5.38

10 28.70 13.90

123 66.61 2.63

min se.10 min se.50 min se.100

1.62 1.41 1.42

1.18 1.01 1.01

2.30 1.49 1.35

5.23 5.02 4.91

13.76 6.58 4.51

min d.10 min d.50 min d.100

1.33 1.18 1.19

1.01 1.00 1.00

2.19 1.52 1.43

5.15 4.96 4.91

ite d.2 ite d.5 ite d.10

1.21 1.20 1.16

1.01 1.00 1.00

1.16 1.10 1.07

bin d.2 bin d.5 bin d.10

1.24 1.25 1.17

1.01 1.00 1.00

1.30 1.32 1.19

Transitions se d gm

5430 0.49 0.77 6.95

6837 12.9K 15.7K 1.0M 6.7M 6.9M 7.2M 9.4M 15.1M 25.1M 0.45 0.12