KeY-C: A Tool for Verification of C Programs Oleg M¨ urk, Daniel Larsson, and Reiner H¨ahnle Chalmers University of Technology, Dept. of Computer Science and Engineering S-412 96 Gothenburg, Sweden
[email protected],
[email protected],
[email protected]
Abstract. We present KeY-C, a tool for deductive verification of C programs. KeY-C allows to prove partial correctness of C programs relative to pre- and postconditions. It is based on a version of KeY that supports Java Card. In this paper we give a glimpse of syntax, semantics, and calculus of C Dynamic Logic (CDL) that were adapted from their Java Card counterparts, based on an example. Currently, the tool is in an early development stage.
1
Introduction
We present KeY-C, a variant of the software verification tool KeY [1] that supports a subset of C as its target language. KeY is an interactive theorem proving environment and allows one to prove properties of imperative/object-oriented sequential programs. The central concept is an axiomatization of the operational semantics of the target language in the form of a sequent calculus for dynamic logic, i.e., a program logic. The rules of the calculus that axiomatize program formulae define a symbolic execution engine for C. The system provides heuristics and proof strategies that automate large parts of proof construction, for example, first-order reasoning, arithmetic simplification, and symbolic execution of loopfree non-recursive programs is performed mostly automatically. The remaining user input typically consists of occasional existential quantifier instantiations. The main creative part is to specify a program including loop (in)variants. KeY was designed to ease interactive proof construction (see screenshot Fig. 1) and to lower the gradient of the learning curve. For example, Java/C Dynamic Logic formulae contain executable source code, not a logic encoding or abstraction. The existing KeY system can handle Java Card and most of sequential Java, allowing verification of complex programs [1, Part IV]. Its calculus contains over 1000 rules of which about half are language-independent Dynamic Logic (DL) rules. We are working on adding gradual support for a portable type-safe subset of C, axiomatized in C Dynamic Logic (CDL). As a side-product of this work we expect to generalize the KeY architecture such that support for further programming languages can be easily added. In Section 2 we give a taste of CDL and illustrate some of the problems that had to be solved during its design by working through a simple, but non-trivial, example that computes the sum of integer elements in a linked list (see Fig. 2). Section 3 describes the current status and further work, and in Section 4 we conclude with related work and a summary.
Fig. 1. KeY graphical user interface.
2
C Dynamic Logic by Example
Dynamic Logic (DL) is based on First-Order Logic (FOL) extended with a type system—function and predicate arguments and function results are equipped with a type. In order to arrive at a reasonable calculus, the subtyping relationship ⊂ must form a lattice. In the formal semantics all elements of the semantic domain also receive a type. In order to represent different states during program execution, function and predicate symbols are split into rigid and non-rigid symbols. Rigid symbols behave the same as in FOL, while non-rigid symbols can have different values in different execution states. CDL is a modal logic with a parametric modality [P] for every (compilable) C program P. The semantics is defined in terms of deterministic Kripke structures, where states (worlds) are determined by the values of non-rigid symbols and transitions are defined by the semantics of C programs. A formula is valid iff it is true in all possible states. The formula [P]φ is true in a state s iff φ holds in the final state reached when P is started in s provided that P terminates at all. In other words, [P]φ asserts partial correctness of [P] w.r.t. postcondition φ. Locations are special non-rigid functions that can have an arbitrary value in different states of the Kripke structure and are used to model modifiable memory locations of the program. I.e., there exists a state for every combination of the values of the locations, while the value of other non-rigid symbols in some state may, for instance, depend on the values of some locations in this state. The type system and the signature of CDL reflect the peculiarities of the C language. To represent integer rvalues we introduce an integer type int with the signature and semantics of mathematical integers ZZ. Further, we need a supertype of all object types Void. Objects are memory locations that hold values and can be referenced by pointers (and consequently are lvalues). Symbols representing pointer rvalues will have a type which is either Void or one of its subtypes. To represent pointer null rvalues we introduce the subtype of all object
struct Node { struct Node ∗ ne x t ; i n t elem ; };
i n t sum ( r e g i s t e r struct Node ∗ f i r s t ) { r e g i s t e r struct Node ∗ p t r = f i r s t ; i n t psum = 0 ; while ( p t r ! = 0 ) { psum = psum + ptr−>elem ; p t r = ptr−>next ; } return psum ; }
Fig. 2. Example C program.
types Null. The semantics of this type consists of exactly one element represented by the constant null. All concrete object types are a subtype of Void and a supertype of Null. All types T are equipped with a scalar object type T@ and value location T@::value : T@ → T. In the example, these are int@ and $Node@, where $Node is a structure object type with rigid member accessor functions $Node::next : $Node → $Node@ and $Node::elem : $Node → int@. In the following we use the more compact notation i.value, n.next, etc. Program variables are represented by location symbols. In the example, their types are ptr : $Node and psum : int@. We use the storage class register of variable ptr to denote that it cannot be referenced by a pointer and consequently can have an rvalue type as opposed to an lvalue type. We left the variable psum without this storage class to illustrate the challenge of having to prove nonaliasing of arbitrary object references of the same type, e.g., psum and ptr.elem. Otherwise, the type of psum would simply be int. We create a proof obligation expressing that function sum actually computes the sum of the elements in a linked list. In general, proof obligations for (partial) correctness have the form pre ⇒ [F]post, where pre specifies assumptions about the initial state and post specifies the requirements for the state if function body F terminates. In order to express the precondition that the function argument first refers to a linked list we need to introduce two fresh rigid functions len : int and list : int → $Node. A possible precondition is now len ≥ 0 ∧ list(0) = first ∧ ∀ int i; ((0 ≤ i) ∧ (i < len) ⇒ list(i).next.value = list(i + 1)) ∧ ∀ int i; ((0 ≤ i) ∧ (i < len) ⇒ list(i) 6= null) ∧ list(len) = null and the postcondition becomes psum.value = sumSpec(len), where sumSpec : int → int is a fresh rigid function and sumSpec(i) gives the sum of the first i elements of list. Its properties are axiomatised in the precondition: sumSpec(0) = 0 ∧ ∀ int i; (i > 0 ⇒ sumSpec(i) = sumSpec(i − 1) + list(i − 1).elem.value) . A sequent calculus is used for performing deduction. A sequent is of the form Γ ` ∆, where Γ and ∆ are sets of formulae. The semantics of a sequent is
V W the same as that of the formula Γ ⇒ ∆. The CDL calculus builds upon standard FOL with equality and arithmetic. DL calculus rules that work on program modalities always modify the first active statement of the programs in the modalities. The main principle of CDL is to reduce program modalities to so-called updates. An atomic update has the form U = {loc := val}, where loc is a location expression and val is its new value term. Semantically, the validity of Uφ in state s is defined as the validity of φ in state s0 , which is state s where the values of locations are modified according to the update U. There are operations for sequential and parallel composition of updates as well as for quantification, where the update is quantified by a free variable satisfying some condition. An update applied to a pure FOL formula can be automatically transformed into a pure FOL formula without an update. Typically, most loop- and recursion-free sequences of program statements can be turned into updates fully automatically. For instance, symbolic execution of the first two lines of the function body in the sequent pre ` [F]post results in pre ` U[W]post, where W is the remaining program starting with the while-loop and U is the following update: { ptr := first; psum := int@::hlookupi(next); next := next + 1; psum.value = 0 } . This update illustrates object allocation in CDL: psum is assigned int@::hlookupi(next) (the object lookup function hlookupi is rigid) and the nonrigid object counter next, pointing to the non-allocated object with lowest index, is incremented. As explained above, the value location of the scalar object type int@ is accessed with the function value. When a loop or a recursive function call is encountered, one must perform induction or use a loop invariant. In our case a suitable invariant I is: ∃ int i; (0 ≤ i ∧ i ≤ len ∧ ptr = list(i) ∧ psum.value = sumSpec(i)) . To establish partial correctness of our loop using an invariant rule for imperative languages [2] one proves that the invariant holds initially, i.e., pre ` UI, as well as formulae pre ` UV(I ⇒ [register int b = ptr != 0](b = 1 ⇒ [B]I)) , pre ` UV(I ⇒ [register int b = ptr != 0](b = 0 ⇒ post)) , where statement B is the loop body. V is a so called anonymous parallel update { ptr = c1 || psum.value = c2 } that resets the variables modified within the loop body B to unknown values represented by fresh skolem symbols. To ensure soundness one is generally required to reset all locations as the loop body must preserve the invariant I for any initial state satisfying I. This requirement can be relaxed in a sound manner to those locations that are modifiable in the loop body, resulting in easier proof obligations. The resulting program modalities are unrolled into updates over modalityfree FOL formulae, which can be reduced into pure FOL formulae. The latter are proven using the rules of typed FOL sequent calculus and the rules expressing
the properties of the particular Kripke structure. For instance, we need axioms expressing the C heap’s forest-like structure, such as: ∀ $Node n; ∀ int i; (n.elem 6= int@::hlookupi(i)), ∀ $Node n1 , n2 ; (n1 = n2 ⇔ n1 .elem = n2 .elem), ∀ $Node n1 , n2 ; (n1 .elem 6= n2 .next) .
3
Current Status and Further Work
At this time, the type system, the signature, and the calculus outlined above are implemented in KeY-C and we are at the stage of debugging the calculus and improving its usability. In essence, we can work with a large subset of C variable declarations, expressions, and we support while and if statements. However, recall that we restrict ourselves to a type-safe, portable subset of C with no raw memory access. Compared to the presentation used in our example the actually implemented CDL calculus is somewhat more complicated. First, C specification introduces the concept of undefined behavior. Further, there are unspecified behavior and values: the C language specification lists the possible options for interpretation, but does not tell which one is actually used. Finally, we need to model trap values. These are invalid values whose attempted access leads to undefined behavior. For instance, a pointer to a deleted object is a trap value, but there can also be integer trap values. Our approach requires to prove, before reducing statements to updates, that conditions leading to undefined behavior cannot occur. Unspecified behavior and values are modeled by introducing fresh skolem symbols. The order of evaluation of some C expressions is undefined. Our approach requires ensuring by external means (for example, by static analysis) that the result of an expression evaluation does not depend on the order. C integer types cannot be represented in CDL by the type int, because the same integer value can have multiple bit representations (e.g., negative zero). In reality, we have for each C integer type (e.g., signed int) a corresponding logic type (e.g., SINT) with corresponding conversion functions (e.g., SINT::toInt and SINT::fromInt). Note that SINT::toInt might not be injection. Creating a calculus for C pointer expressions contains many technical challenges. In C, objects can be deleted and C pointers may point to local variables that eventually go out of scope. C allows arithmetic operations on pointers to the elements of the same array or to the element past the last element. Finally, C supports deep value assignments o1 = o2 of objects, where all member values of o2 are copied into o1. Such assignments can be modeled by just rewriting them into a sequence of scalar assignments, but we reduce them to an update. Supporting full C of course, requires a lot more work to model numerous minor and major features of the C language: for loops, const and volatile modifiers, string literals, typedefs, enumerations, const expressions, unions, bit-fields, varargs, and different forms of jump statements just to name a few. Another conceptual extension is introducing modularity: translation units, extern and static variables, function calls, and function pointers. Luckily the
C module system can be viewed as a special case of the Java module system, so taking over the calculus from the KeY for Java should be straightforward, although laborious. Calling function pointers can be implemented in the same way as polymorphic method calls in Java which is fully realized in KeY.
4
Related Work and Summary
There are several automatic and interactive verification tools for C including approaches based on abstraction and model checking. In addition, tools such as SLAM (research.microsoft.com/slam) concentrate on bug finding. For lack of space we only mention the most relevant. In the Caduceus tool [3] correctness assertions over C programs are compiled into an intermediate programming language for which verification condition generators and FOL prover backends are available. Interleaving of symbolic execution and first-order reasoning is not possible and interaction takes place on the level of intermediate code, not C source code. As part of the Verisoft project (www.verisoft.de) a Hoare calculus and formal semantics of the C subset C0 on top of Isabelle/HOL were developed [5], however, verification of C programs is less automated than in KeY. In this paper we briefly described the ongoing effort to develop KeY-C, the C target version of the verification system KeY. The implementation is done in Java and we are using the Cetus framework [4] for parsing and analysing C programs. As a side-product of this work we expect to generalize KeY architecture for easily adding the support for new programming languages. The Java target version of KeY is available from www.key-project.org. KeY-C is available from the authors on request. Acknowledgements We benefited from many discussions with the members of the KeY project. The remarks of the reviewers led to several clarifications.
References 1. Bernhard Beckert, Reiner H¨ ahnle, and Peter Schmitt, editors. Verification of ObjectOriented Software: The KeY Approach, volume 4334 of LNCS. Springer, 2006. 2. Bernhard Beckert, Steffen Schlager, and Peter H. Schmitt. An improved rule for while loops in deductive program verification. In Kung-Kiu Lau, editor, Proc. 7th Intl. Conf. on Formal Engg. Methods (ICFEM), Manchester, UK, volume 3785 of LNCS, pages 315–329. Springer, 2005. 3. Jean-Christophe Filliˆ atre and Claude March´e. Multi-prover verification of C programs. In Jim Davies, Wolfram Schulte, and Michael Barnett, editors, Formal Methods and Software Engg., 6th Intl. Conf. on Formal Engineering Methods, ICFEM, Seattle, USA, volume 3308 of LNCS, pages 15–29. Springer, 2004. 4. Sang Ik Lee, Troy A. Johnson, and Rudolf Eigenmann. Cetus—an extensible compiler infrastructure for source-to-source transformation. In Lawrence Rauchwerger, editor, Languages and Compilers for Parallel Computing, 16th Intl. Workshop, College Station, TX, USA, Revised Papers, volume 2958 of LNCS, pages 539–553, 2004. 5. Norbert Schirmer. Verification of Sequential Imperative Programs in Isabelle/HOL. PhD thesis, Technische Universit¨ at M¨ unchen, 2006.