Dec 28, 2007 - To derive an algebraic specification from a Java class with a static analysis, ...... fails to exercise some of the code of the class under consideration. ..... classes,â in ECOOP 2003 - Object-Oriented Programming, 17th European.
Discovering Documentation for Java Container Classes Johannes Henkel, Christoph Reichenbach, and Amer Diwan Department of Computer Science, University of Colorado at Boulder December 28, 2007 Abstract Modern programs make extensive use of reusable software libraries. For example, we found that 17% to 30% of the classes in a number of large Java applications use the container classes from the java.util package. Given this extensive code reuse in Java programs, it is important for the reusable interfaces to have clear and unambiguous documentation. Unfortunately, most documentation is expressed in English, and therefore does not always satisfy these requirements. Worse yet, there is no way of checking that the documentation is consistent with the associated code. Formal specifications present an alternative which does not suffer from these problems; however, formal specifications are notoriously hard to write. To alleviate this difficulty, we have implemented a tool which automatically derives documentation in the form of formal specifications. Our tool probes Java classes by invoking them on dynamically generated tests and captures the information observed during their execution as algebraic axioms. While the tool is not complete or correct from a formal perspective we demonstrate that it discovers many useful axioms when applied to container classes. These axioms then form an initial formal documentation of the class they describe.
1 Introduction One of the hallmarks of modern programming languages is the support for reusable code. This support comes both in the form of language features (such as inheritance) and through extensive libraries for commonly used functionality and data structures (such as hash tables). Language support allows users of the language to create their own reusable libraries and to distribute these widely. Figure 1 gives the number of classes in several Java applications that use the reusable container classes from the
1
Jedit 4.1 Xalan J 2.5.2 Jakarta Velocity 1.1b1 Apache Tomcat 5.0.19 Eclipse 3.0-M7 JBoss 4.0.0DR2
# classes 644 2,395 2,610 3,959 21,774 32,820
# using containers 123 (19.1%) 398 (16.6%) 780 (29.9%) 1,084 (27.4%) 4,757 (21.8%) 7,888 (24.0%)
Figure 1: Use of containers based on Java container interfaces in typical Java programs java.util package. We see that a significant fraction of classes in Java applications take advantage of these reusable containers. Since Java applications rely so heavily on libraries, it is crucial for the libraries to be documented in a clear and unambiguous fashion. Moreover, we would like the documentation to be at least somewhat machine checkable so that library writers and users can automatically detect when documentation and code start to diverge. Today, most documentation for reusable code is written in a natural language (e.g., English for the documentation of the Java standard libraries), which is not machine checkable. Moreover, natural language is sometimes unclear, ambiguous, or incomplete. In contrast, formal specifications can document code clearly and unambiguously, and static verification or dynamic testing can often diagnose discrepancies between implementation and specification, or inconsistencies in a formal specification. Unfortunately, writing and debugging formal specifications is difficult. This paper describes a tool that automatically discovers documentation for some properties of Java classes and emits this documentation in a formal notation. The notation we use is algebraic specifications [1]. Algebraic specifications are particularly effective at describing the behavior of container classes, which are heavily reused (Figure 1). Our discovered documentation concisely describes how to use the methods in a class without being bogged down by implementation details. To derive an algebraic specification from a Java class with a static analysis, we would need to model the full language semantics and then abstract statically from
2
complex, heap-centric manipulations of implementation data structures into a concise, expressive algebraic specification of the externally visible behavior. For Java and similar languages, such an analysis is hard, since the language semantics are complex and only informally specified; native calls, reflection and dynamic class loading can further complicate analysis. In contrast, a dynamic analysis can attack the problem of finding specifications from the opposite direction: Instead of trying to abstract implementation details into intended intended high-level behavior, a dynamic analysis can probe classes from the outside, by observing how concrete instances behave in practice. Based on the results of this probing, the analysis can formulate hypotheses on the behavior of the given class, and attempt to refute those. A dynamic analysis can therefore offload all the complications of language semantics to pre-existing evaluation mechanisms. Furthermore, such a dynamic analysis can avoid directly dealing with reflection, native calls and dynamic class loading whenever these language features are not exposed in the API. Therefore, we opted for a dynamic discovery mechanism in our system. Intuitively, our approach is as follows. We automatically generate test cases for the class under consideration and execute those test cases. Based on the outcome of the executions, we propose relations that we generalize into axioms. Since all axioms stem from the observed behavior of the class, we find that our axioms are almost always correct. Our approach is inspired by recent work on discovering likely specifications based on program runs [2, 3, 4, 5]. Our methodology complements this prior work in that it is the first to discover algebraic specifications. Prior work can be split into systems which detect Hoare-style specifications (such as Daikon [2]) and systems which find finite state automata (such as Ammons et al.’s system [3]). We find that, for the task of documenting container classes, algebraic specifications are superior to the above approaches (Section 6). 3
We evaluate our system by running it on several container classes, including some from the Java standard libraries. Our experiments reveal that our approach yields mostly correct documentation for these kinds of classes; i.e., axioms are rarely wrong. We find that our system sometimes misses useful axioms. In such cases, we have attempted to determine the underlying reasons for why our system missed the expected axioms. Further, we present mechanisms by which users can guide our system to find such missed axioms. This paper improves and extends upon the presentation, ideas, and experimental results presented in our previous work [6]. Specifically, • We describe three optimizations not listed in our previous work (Section 3.2.3). Of these, two optimizations are optional. We evaluate the effectiveness of both optional optimizations (Section 4.2.1). • Our system can now properly describe exceptions (Section 3.3.4). • We discuss how our system can be extended to support new kinds of axioms (Section 3.3.5). • We evaluate the performance of the axiom validation subsystem (Section 5). • We added two case studies: one in which we interpret a discovered specification (Section 4.4.2) and another in which we compare the axioms reported by our systems to a specification we developed without aid from our tool (Section 4.4.3). The remainder of the paper is organized as follows. Section 2 discusses and illustrates the algebraic notation that we use in the discovered documentation. Section 3 details our approach to dynamic documentation discovery. Section 4 evaluates our system. Section 5 discusses the strengths and weaknesses of our approach. Section 6 reviews related work, and Section 7 concludes.
4
1 2 3 4 5 6 7 8
specification ObjectStackSpecification class ObjectStack is edu.colorado.cs.simpleadts.ObjectStack class Object is java.lang.Object method NewObjectStack is method push is method pop is define ObjectStack
spec. name
sorts
operations sorts defined here
9 10 11
forall s:ObjectStack forall o:Object pop(push(s, o).state).retval =gte o
(Axiom 1)
forall s:ObjectStack forall o:Object pop(push(s, o).state).state =gte s
(Axiom 2)
pop(ObjectStack().state) throws java.util.NoSuchElementException
(Axiom 3)
12 13 14 15 16 17
algebraic axioms
Figure 2: Example Specification for an ObjectStack class.
2 Background Our system discovers documentation for Java classes using the notation described in our prior work [7]. This notation is designed to be close to the Java programming language, to make it easier for Java programmers to understand the discovered documentation. To illustrate the documentation language, Figure 2 gives the documentation in our language for an ObjectStack class. Our document notation is based on algebraic specifications and thus has two parts: an algebraic signature (e.g., lines 2–7 in Figure 2) and a set of axioms [8, 9] (e.g., lines 10–17 in Figure 2). The algebraic signature itself has two parts: sorts (e.g., lines 2–4 in Figure 2) and operations together with their signatures (e.g., lines 5–7 in Figure 2). Intuitively, sorts give the types of interest to the underlying term algebra, whereas operations are the concrete entities from which terms in this algebra are built. Sorts fall into two categories: sorts which are used by the specification and sorts which are defined by the specification; the latter must be listed explicitly (e.g., line 8 in Figure 2).
5
The axioms (e.g., lines 10–17 in Figure 2) equate terms in the algebra. Therein, the equality operator (=gte ) (Section 3.4) specifies that the two operands are observationally equivalent (equivalent in terms of “general term equivalence”, as defined in Section 3.4), while the exception operator (throws) specifies that the left-hand side, when executed, throws the exception given as the right-hand side (Section 3.3.4). More precisely, given a Java method named m defined in a class represented by sort cls with n arguments arg1 , . . . , argn of sorts sort(arg1 ), . . . , sort(argn ), with the return type represented by sort ret, we construct the signature of an algebraic operation m within an algebra cls as follows: m : cls × sort(arg1 ) × ... × sort(argn )
→ cls × ret
(1)
The receiver argument (of sort cls) to the left of the arrow characterizes the original state passed into m. The receiver type to the right of the arrow represents the state of the receiver after executing m. The right hand side of the arrow also includes the return value (of sort ret, which can be the sort void). In case the operation corresponds to a Java constructor, the receiver argument to the left of the arrow is of sort void. Note that while our full specification language [7] can capture side effects to arguments, this paper assumes that invoking a method may only modify the observable state of the receiver and return a result (cf. Equation 1). We have not found this limitation to be a problem for axiomatizations of container classes. Consider the first two axioms, in lines 10-14 in Figure 2. The .retval and .state qualifications select elements from the result tuple. .state retrieves the first element, which is the possibly modified receiver object (this). .retval retrieves the second element, which is the method’s return value. Both axioms are universally quantified over all object stacks and all objects. In our language, universal quantification over a type excludes the special object null. This is important to ensure that receiver objects are non-trivial. Axioms which use a null parameter are
6
therefore specified separately.1 Axiom 1 (Figure 2) states that invoking pop after a push returns the object that was last pushed. Axiom 2 states that invoking pop on an ObjectStack right after invoking push reverts the stack back into its prior, prepush state. These axioms concisely and clearly document the interface of the ObjectStack, except for error conditions. We express error conditions by relating terms to exceptions with the binary relation throws, as in Axiom 3 (lines 16-17 in Figure 2): This axiom specifies that any attempt to pop an object off an empty stack will raise a java.util.NoSuchElementException.
3 Discovering Algebraic Specifications from Java Classes We discover algebraic specifications automatically from Java classes. The discovered specifications use the notation illustrated in Section 2. A sound, static analysis for discovering algebraic specifications from implementations is impractical.
For illustration, consider the axiom forall a:A
foo(a).state = bar(a).state. To discover this axiom, a static analysis must at least prove that given any instance of A, foo and bar will generate states that are equivalent. This subproblem is undecidable, and to our knowledge, no conservative solution exists that generates useful results for real world programs—note that foo and bar may call arbitrary methods, use Java reflection, etc. Our dynamic program analysis approximates a solution for the equivalence problem by recording the observable behavior at runtime and validating its observations with tests. Continuing with the above example, instead of proving that foo and bar generate an equivalent state, we observe the equivalent state for a number of examples of A. Unless we find evidence to the contrary, we optimistically assume that the 1 In some cases, this means that axioms must be duplicated if a class behaves identically for null and non-null parameters. This minor inconvenience does not arise very frequently and could be addressed with a slight extension to the type system.
7
Term Generator
General Term Equivalence
Section 3.2
Section 3.3
terms
uses
uses
Equation Generator
Axiom Generator
Rewriting Engine
Section 3.4
Section 3.5
Section 3.6
equations
axioms
less redundant axioms
Figure 3: Architectural overview of our tool for discovering algebraic specifications equivalence holds. Therefore, by missing a counter-example, we may report an incorrect axiom. While it is unfortunate that we can only guarantee soundness for behavior that is covered by the tests we run, our prototype system performed well with all real container classes that we tried (Section 4). Fig. 3 gives an overview of our approach. We start by generating terms (Sections 3.1 and 3.2). These terms consist of operation applications (which correspond to Java method and constructor invocations) and constants, such as integers and particular Java objects. We use these terms to generate relations (Section 3.3). For example, we may observe that two terms evaluate to the same value; this observation we then encode as an equation. To determine whether values produced by two terms are equal (Section 3.4), we use observational equivalence [10]: we consider two terms to be the same if they behave the same with respect to their public methods. Thus, observational equivalence abstracts away from low-level implementation details and only considers the interface.
8
We then generalize the relations into axioms by replacing subterms with universally-quantified typed variables (Section 3.5). To determine (with some probability) whether these axioms hold, we check them by generating test cases. Finally, we rewrite our axioms to eliminate redundancies (Section 3.6).
3.1 Automatically Mapping Java Classes to Algebras Due to the design of our specification language (Section 2), the mapping from Java classes to algebraic signatures is trivial: each Java type is a sort in the algebra and each Java method is an operation on the appropriate sorts. For example, our system determines the sorts and operations declared in lines 2-7 of Fig. 2 automatically from the ObjectStack class: The system queries the Java reflection API at runtime to find the methods that belong to the input class and the argument and return types of the methods.
3.2 Generating Terms The term generator (often abbreviated as just “generator” in this section) emits an endless stream of terms (see Figure 3). Each term models a particular object state. For example, the following two terms both model the state of an empty object stack. NewObjectStack().state pop(push(NewObjectStack().state, Object@7).state).state Here, Object@7 refers to a particular instance of the Object class. Terms serve the role of test cases in our discovery tool. Therefore, the term generator’s goal is to produce terms that thoroughly explore the state space of the input class. Thus, the generator produces terms that end with .state and not with .retval; we generate terms ending in .retval in a separate phase (Section 3.3). 3.2.1 Growing Terms To generate a term for a particular algebra, the term generator first constructs an operation application corresponding to a Java constructor invocation. For example, to 9
NewObjectStack :state
NewObjectStack
: Obj ectS tack
:state
: Obj ectS tack
:state
: Obj ectS tack
pop
a. N ewObj ectS tack
():state: Executes ne.
(
b. pop N ewObj ectS tack
NewObjectStack :state
():state):state: Throws exception.
NewObjectStack
: Obj ectS tack
:state
: Obj ectS tack
push :state
c.
Backtracking step.
(
d. push Obj ectS tack
NewObjectStack :state
Object@7
: Obj ectS tack
: Obj ect
: Obj ectS tack
():state; ?):state: Argument is missing.
NewObjectStack :state
push :state
Object@7
: Obj ectS tack
: Obj ect
push
: Obj ectS tack
:state
: Obj ectS tack
pop :state
( Executes ne.
e. push N ewObj ectS tack
():state; Obj ect@7):state:
: Obj ectS tack
( ( Executes ne.
f. pop push N ewObj ectS tack
Figure 4: Growing terms incrementally.
10
():state; Obj ect@7):state):state:
Table 1: Default constant pools Type Constant Pool Object {o0 , o1 , o2 } int {0, 1, 3} long {0, 1, 3} boolean {true, false}
generate a term for the ObjectStack algebra the term generator starts with NewObjectStack, as shown in Fig. 4a. The term generator then grows the term incrementally, by adding one operation application (corresponding to one Java method invocation) at a time. Whenever the addition of an operation application causes the term to generate an exception when executed, the term generator backtracks, because growing an exception throwing term further does not normally reveal interesting behavior. In the example in Fig. 4, the term generator backtracks (step c), since the pop operation applied to an empty stack yields an exception (step b). The generator needs to provide arguments for the operations used to grow the term. From a conceptual point of view, it could simply recurse to generate arguments of the appropriate type. However, for efficiency and quality of ultimate results, we identify two special cases that we handle differently: arguments of a primitive type (such as int) and arguments that are instances of immutable classes. In our system, we consider classes to be immutable if none of the operations they expose alter the observable behavior of their instances. This definition subsumes the notion of “classes without mutators”, such as java.lang.Object, but includes further classes, such as classes which keep internal state which they never expose, or only expose by means opaque to our system (such as file I/O). For both of these cases we use values from a pool of constants, pre-defined for each type. Table 1 lists the default settings; the three objects for Object denote three arbitrary objects which are referentially distinct. Users indicate which types they want our system to consider as immutable. Furthermore, users can introduce additional
11
values that they want the system to use for testing and exploration. If this pool of constants is too small, we will not explore enough of the state space and may end up missing some axioms; also, we may fail to invalidate proposed axioms for which a counter-example exists. If the set is too large, the tool will become too inefficient for practical use. For mutable classes (i.e., classes which are not immutable), we can not safely replace subterms with a particular instance as a precomputed constant, since then the potentially modified state of the instance would be used in the next evaluation of the term. For example, if we execute the term push(ObjectStack@4, Object@42).state multiple times, we will compute a different observable state for the receiver each time. 3.2.2 Policy for Term Generation As discussed above, the quality of the terms ultimately determines the quality of the axioms discovered by our system. Our current system generates all terms up to a userdefined size, but uses the default (no-arguments) object constructor exclusively. The most simple term for a given class T is typically a constructor, e.g. NewT().state. However, if we consider an arbitrary subtype U of T , we note that any of U ’s constructor operations might be used as well– i.e., we could use NewU() in any place NewT() is allowed. To keep the exploration space tractable, the discovery tool does not consider terms of type U when it needs a term of type T . Users can bypass this limitation by manually supplying immutable objects of type U to be used for T or by stating that type U should be used instead of type T in general. Thus, users can guide the term generator in generating its test cases for the discovery system. 3.2.3 Optimizations Terms are redundant if they produce objects with the same state as other (smaller) terms. To make the exploration of large object state spaces practical, we employ the following optimizations to discard such redundant terms.
12
Optimization 1: Avoid side-effect free operation applications. Since the generator needs to explore the state space of instances of a class, operation applications that do not modify the state of the receiver need not be part of any terms. For example, the toString() method rarely affects the state of the receiver. Thus, the generator often skips the toString method when generating terms. Our system experiments to detect side-effect free operation applications: if applying the operation to an object does not change the state of the object then our system considers the operation application to be side-effect free.
Optimization 2: Only generate the minimal term for any generated object representation. When two different terms generate objects with the same internal representation, the generator keeps only the smaller of the two terms. Whenever the generator produces a term, it executes this term and attempts to store a serialization of the resulting state into a hash table. If the state already exists in the hash table, the generator knows that it has seen the state before and thus the new term is unnecessary. Since the generator produces terms in order of increasing size, this mechanism will always discard larger terms in favor of smaller terms. The equation generator (Section 3.3) also uses the hash table conflicts to cheaply identify equations.
3.3 Finding Relations From terms, our system constructs relations. These relations formalize the hypotheses on the behavior of the Java class we are analyzing. The form of relations generated by our system therefore determines the form of the algebraic specifications that we can discover. Our current implementation generates equalities and throws clauses. The relations we currently support are: state equations (Section 3.3.1), observer equations (Section 3.3.2), difference equations (Section 3.3.3), and exception relations (Section 3.3.4). We generate relations in the order in which we present them here.
13
3.3.1 State Equations: Equality of Distinct Terms State equations describe the equivalence of state terms. For example, pop(push(ObjectStack().state, Object@4).state).state =gte ObjectStack().state is a state equation. These equations are useful in characterizing how operations affect the observable state of an object. We generate these equations whenever we find that two distinct terms are equivalent. We get some of these equations from “Optimization 2” in Section 3.2.3: whenever we try to insert an already-existing entry in the hash table, we have a potential equation. Since hashing does not use full observational equivalence, this method may not find all state equations. Thus, our system uses two mechanisms for finding state equations: a fast hash-based mechanism, and a less efficient general term equivalence-based mechanism. 3.3.2 Observer Equations: Equality of a Term to a Constant These equations take the following form (where obs is an observer and c a constant): obs(term1 , arg2 , ..., argn ).retval =gte c Observer equations characterize the interactions between operations that modify and operations that observe. For example, size(push(ObjectStack().state, Object@4).state).retval =gte 1 is an observer equation. To generate observer equations we start with a term and apply an operation that returns a constant. To determine whether a term returns a constant, we execute the term using the Java Reflection API, and record the evaluation result. We evaluate the term a second time to heuristically determine that the term consistently returns the same value. We then form an equation by equating the term to that value, which we assume to be constant. Note that constants might be object constants. This is where 14
the scheme for immutable objects described Section 3.2 (“Arguments of immutable classes”) becomes important: During term generation, we use immutable objects as constants; often, those objects then become the return value of a term. For example, pop(push(ObjectStack().state,Object@1).state).retval evaluates to the object constant Object@1. 3.3.3 Difference Equations: Constant Difference Between Terms These equations take the following form (where obs is an observer, op an operation computing a value of a primitive type, and diff a constant of a primitive type): op(obs(term1 .state).retval, diff ).retval = obs(term2 ).retval For example: IntAdd(size(ObjectStack().state).retval, 1).retval =gte size(push(ObjectStack().state, Object@3).state).retval In this example, op is IntAdd (i.e., integer addition) and diff is 1. To generate such axioms, we generate two terms, apply an observer to both terms, and take their difference. In practice, we found that difference equations with a small value for diff are the most interesting ones. Therefore, we only generate such an equation if diff is below or equal to a fixed threshold (set to 100 for our experiments). This technique filters out most spurious difference equations. 3.3.4 Exception Relations: Term Raises an Exception When Evaluated Exception equations describe terms which throw exceptions, and relate those terms to the concrete exception thrown. We generate such relations by tracking exceptions observed by our term generator during the term generation process (Section 3.2.1). Exception equations are the exclusive source for axioms built from the relation throws.
15
3.3.5 Extensibility Our system is easily extensible with new relation generators. We added an initial and fully operational version of the exception relation generator in less than an hour. This initial implementation of the exception relation generator first produced a list of relations for which the term generator mechanism had observed exceptional behavior before, but we found that this design did not scale to the largest test cases we were considering. We therefore changed the design to produce throws clauses on demand (rather than ahead of time). The resulting relation generator occupied 121 lines of code, which included an experimental feature decoupling relation generation from validation to allow parallelization: the code generates exception-throwing terms in one thread and passes them to a second thread, which generates an appropriate relation and attempts to abstract an axiom from the relation. For equations and throws clauses, programmers can add a new relation generator by implementing a class with two operations, reset and genRelation, which reset its internal generator state and generate the “next” relation, respectively. Our framework implicitly provides this class with mechanisms for depth-bounded term generation. In practice, we have found that most implementation effort goes into the bookkeeping of implementing genRelation as an iterator, i.e., without precomputing a table of all possible relations. Apart from the exception generator, the difference equation generator uses the most complex relation generation mechanism, occupying 20 lines of Java code. Other kinds of relations (such as less-than) are not presently supported by our system.
3.4 Term Equivalence To determine whether or not two terms are the same, the specification discovery tool uses a variant of observational equivalence testing [10] that takes advantage of Java ref-
16
erence semantics. Two values are observationally equivalent if they behave the same when we apply arbitrary operations to both of them, even if their internal representation differs. Kamin [11] gives a formal account of this notion, though the idea of operational equivalence goes back further (for example, Plotkin [12] applies the idea). For illustration, consider an ObjectStack implementation in Java that uses an array as the internal representation and does not null out popped entries. For this implementation, the terms ObjectStack().state and pop(push(ObjectStack().state, Object@7).state).state generate two objects with different representations: the second term generates a stack with a stale reference to Object@7 in the array representation. However, both terms produce objects that are observationally equivalent. We extend Doong and Frankl’s notion of observational equivalence by introducing the concept of consistency, which allows us to accommodate Java methods such as hashCode (Section 3.4.1), and we exploit reference equality for performance optimizations (Section 3.4.2). Since observational equivalence (which we refer to as EQN) is expensive, we identify three special cases that we can handle more efficiently: primitive types, immutable classes, and representation equivalence. Algorithm 1 describes how we dispatch between the four possibilities. In Algorithm 1, the parameters for GTE-DISPATCH are terms, not the values computed by the terms. Thus, our comparison mechanisms, PRIMITIVE-EQUALS (Section 3.4.1), REFEQ-APPLIES (Section 3.4.2), REP-EQUALS (Section 3.4.3), and EQN (Section 3.4.4) can evaluate the terms more than once if they need more than a single sample of what the terms evaluate to. In Algorithm 1 and the following, =ref checks reference equality and =val checks value equality. We use EVAL(t) to denote the value computed by a term.
17
Algorithm 1 General term equivalence dispatch (computes =gte relation) GTE-DISPATCH(t1 , t2 ) begin Require: t1 , t2 are terms of the same type T. if T is a primitive type then return PRIMITIVE-EQUALS(t1,t2 ) else if REFEQ-APPLIES(t1 , t2 ) then return EVAL(t1 )=ref EVAL(t2 ) else if REP-EQUALS(t1 , t2 ) then return true else return EQN(t1 ,t2 ) end if end if end if end GTE-DISPATCH is accurate whenever it exposes non-equivalence. It may be inaccurate when it claims to have found equivalence. We now describe the algorithms for the four cases in more detail. 3.4.1 Equivalence for Primitive Types Assume that the ObjectStack has a size method. We can confirm the equivalence size(push(ObjectStack().state, Object().state).state).retval =gte 1 by observing that the evaluation of the corresponding Java invocation sequences yields 1. Let us now consider hashCode(ObjectStack().state).retval. Since the hashCode function computes a different value for each ObjectStack instance, this term evaluates to a different value each time, except for hash collisions. We say that this term is not consistent; A consistent term produces the same result each time it is executed (this property is also referred to as referential transparency in the literature). To approximate consistency, Algorithm 2 evaluates each term twice. If both evaluations yield the same result, we optimistically consider the term consistent. If neither term is consistent, we 18
consider them equal. If exactly one term is consistent, we consider them to be nonequal. Finally, if both terms are consistent, we compare their evaluation results by using value equality. Algorithm 2 Equivalence for primitive types PRIMITIVE-EQUALS(t1, t2 ) begin Require: t1 , t2 are terms computing values of the same primitive type. result1a ← EVAL(t1 ), result1b ← EVAL(t1 ) result2a ← EVAL(t2 ), result2b ← EVAL(t2 ) consistent1 ← result1a =val result1b consistent2 ← result2a =val result2b if not consistent1 and not consistent2 then return true end if if not consistent1 or not consistent2 then return false end if return result1a =val result2a end In practice, the most prominent situation in which inconsistent terms arise is the default implementation of the hashCode method, as exemplified above, though our algorithm would also consider random number generators initialized by the system clock to be equal. Our approach of considering all inconsistent terms of the same sort to be equal is a heuristic which captures the above situations. In practice, we have not encountered a situation in which this heuristic was responsible for an incorrect axiom, though such situations are conceivable: For example, when applied to a factory object for fresh names, as used in some interpreters and proof assistants, our approach would be incapable of distinguishing two factory objects which produce mutually disjoint sets of names. 3.4.2 Comparing the References Computed by Terms Since our algorithm currently handles only side effects to instance variables of receivers and we use instances of immutable classes from a fixed set (Section 3.2), there are many situations where we can use reference equality rather than to resort 19
Algorithm 3 Checking for reference equality REFEQ-APPLIES(t1 , t2 ) begin Require: t1 , t2 are terms of the same reference type. return EVAL(t1 ) =ref EVAL(t1 ) and EVAL(t2 ) =ref EVAL(t2 ) end
to the more expensive observational equivalence algorithm (Section 3.4.4). Thus, we use a heuristic similar to that for primitive types (Algorithm 3): we evaluate each term twice and check whether both terms evaluate to the same references in both evaluations. To gain more confidence we could evaluate each term even more often, but we have not yet encountered a need for this.
If they do, then
we can simply compare the references that they return (part of Algorithm 1). If they do not, then we use the observational equivalence procedure. For example, for the terms pop(push(ObjStack().state, Object@123).state).retval and Object@123, REFEQ-APPLIES returns true, while it returns false for pop(push(ObjStack().state, Object@123).state).state and ObjStack().state. Java semantics guarantee that each constructor invocation generates a fresh reference. 3.4.3 Representation Equivalence Algorithm 4 Representation equivalence REP-EQUALS(t1 , t2 ) Require: t1 , t2 are terms, evaluating to instances of some class C if SERIALIZE(EVAL(t1))=valSERIALIZE(EVAL(t2)) then return true else return false end if
Algorithm 4 shows pseudocode for a test of representation equivalence. Representation equivalence checks if both terms evaluate to objects with identical in-memory
20
representation by using SERIALIZE, which pickles an object into a binary representation, such as a byte array. Our implementation serializes all objects reachable from the starting object by traversing the object graph using reflection and expressing object references as unique identities encoding the number of steps since the start of the traversal. If the objects have the same inner state, then they are guaranteed to be almost observationally equivalent: The only ways to distinguish two objects with identical representations are their references and possibly their hashcodes (if the objects use the default Java hashCode implementation). However, default hashcodes are not consistent (Section 3.4.1) and thus cannot be used to distinguish objects in our system, and since the semantics of our specification language are value semantics, we can safely assume (for our specification language) that representation equivalence implies observational equivalence. 3.4.4 Observational Equivalence Algorithm 5 Observational equivalence EQN(t1 , t2 ) Require: t1 , t2 are terms, evaluating to instances of some class C repeat Generate a test stub stb, with an argument of type C for all observers ob applicable to evaluation results of stb do stubapp1 ← stb(t1 ), stubapp2 ← stb(t2 ) obsapp1 ← ob(stubapp1 , . . .).retval, obsapp2 ← ob(stubapp2 , . . .).retval if not GTE-DISPATCH(obsapp1 ,obsapp2 ) then return false end if end for until confident of outcome return true
Algorithm 5 shows pseudocode for our version of EQN. EQN approximates observational equivalence of two terms of class C as follows. It starts by generating term stubs that take an argument of type C and applies the stubs to t1 and t2 . It then picks an observer and applies it to both terms; an observer is an operation op such that the type
21
of op(. . . ).retval is not void. The algorithm then compares the outputs of the observers using GTE-DISPATCH (Algorithm 1). If GTE-DISPATCH returns false, then we know that t1 and t2 are not observationally equivalent, otherwise we become more confident in their equivalence. This algorithm could potentially run into an infinite recursion, which we prevent by not using EQN within GTE-DISPATCH for observer results. For example, consider applying this procedure to terms from the ObjectStack algebra. An example of a stub is λx.push(x, Object@2).state and an example of an observer is pop. If t1 is push(ObjectStack(), Object@3).state, the application of the stub and the observer yields pop(push(push(ObjectStack().state, Object@3).state, Object@2).state).retval In this particular example, recursing to GTE-DISPATCH will return equality for any t2 , as the stub dominates the observations gathered from the observer, i.e. this particular test stub obscures the distinguishing properties of t1 . A difference between t1 and t2 would arise only after trying a different stub (such as λx.x, the identity function) or observer (not possible here, as pop is the sole observer of this algebra).
Test Stub and Observer Generation. We construct test stubs as terms with “holes” as in the above example. In general, test stubs and observers, when combined, have the structure λx.obs(C(x).state, v1 , . . . , vn ).retval where obs is an observer, C(x).state is a term derived from x, and the vi , i ∈ {1, . . . , n} are possible additional parameters to the observer. To ensure termination, we limit the term size of all vi to a user-specified test stub size limit, and the term size of C(x).state to the test-stub size limit plus the size of x. Here and in the following, we define term size as the number of variables, constants and operation applications occurring within a given term, disregarding the selectors state and retval. For example, the size of the term push(IntStack().state, 4).state is 3. 22
An equation t1 = t2 passes the EQN test iff we observe t1 and t2 to behave identically (as discussed above) for all test stubs we can generate up to the user-specified test stub size.
Optimization: Test Stub Reuse
During observational equivalence testing, our sys-
tem generates test stubs to determine whether terms are produce different results upon evaluation. As soon as we have found a test stub which invalidates the assumption that two given terms are equal, we can abort the test: We know for certain that the two terms are not equal. In practice, some combinations of observers and test stubs are more useful than others. For example, the push followed by pop we just considered would be useless as a test stub, whereas a single “λx.pop(x)” operation on a stack would allow the second most recent stack entry to be observed (via the observer “pop”) and thereby constitute a useful test. Our system records test stubs which have been previously successful in showing that two terms are distinct. Our system retains these stubs in the order they were discovered, and applies them before invoking the test stub generation process described in Section 3.4.4.
3.5 Generating Axioms Our axioms are 4-tuples (t1 , R, t2 , V ), where t1 and t2 are terms, R is a binary relation symbol, and V is the set of all variables that occur freely in t1 and t2 ; we require that such variables are typed. A relation is simply an axiom with V = {}. We refer to relations with R = (=gte ), as equations, and to relations with R = throws as throws clauses. The generation of axioms is an abstraction process that introduces variables into relations. For example, given the relation IntAdd(size(ObjectStack().state).retval, 1).retval =gte size(push(ObjectStack().state, Object@3).state).retval
23
(2)
our axiom generator can abstract ObjectStack().state to the variable s of type ObjectStack and Object@3 to i of type Object to discover the axiom ∀s : ObjectStack ∀i : Object IntAdd(size(s).retval, 1).retval =gte size(push(s, i).state).retval
(3)
Algorithm 6 Axiom generation generateAxiom(Algebra) begin (term1 , term2 ) ← generate a relation in Algebra Subterms ← the set of subterms occurring in term1 or term2 V ← a set of typed, universally-quantified variables x1 , ..., xn with one xi for each subtermi ∈ Subterms for xi ∈ V do Replace each occurrence of subtermi with the variable xi in term1 and term2 Generate a set of test cases testset where each test case is a set of generated terms {test1 , ..., testn }, such that testj can replace xj ∈ V for testcase ∈ testset do Set all xj ∈ V to the corresponding testj ∈ testcase if EQN-DISPATCH(term1 ,term2 )= false then Undo the replacement of subtermi in term1 and term2 Stop executing test cases end if end for end for Eliminate all xi from V which occur neither in term1 nor in term2 return the axiom (term1 , term2 , V ) end Algorithm 6 describes the axiom generator, with optimizations left out for clarity.
To generate an axiom for a particular algebra, we first use
any of the relation generators as described in Section 3.3 to come up with a relation. For and
We then compute the set of all subterms of term1 and term2 .
example,
given
the
ObjectStack().state,
terms
push(ObjectStack().state, Object@4).state
the
set
{push(ObjectStack().state, Object@4).state,
of
subterms
ObjectStack().state,
would
be
Object@4}.
We then initialize V as a set of variables, such that for each subterm there exists exactly one corresponding universally-quantified variable in V . The loop then checks for each subterm whether we can abstract all occurrences of this subterm to the 24
corresponding variable. We proceed as follows: 1. We replace all occurrences of the subterm with the matching variable. 2. We generate test cases, where each test case replaces all the variables in the terms with generated terms. For this purpose, we re-use our term generation mechanism (Section 3.2.1), configured to a user-specified maximum term size. 3. We then compare whether term1 and term2 are equivalent under all test cases. If so, we retain the modified term, and thus greedily explore the space of possible abstractions. Otherwise, we undo the replacement of the subterm. If fewer than two test cases are available, we also undo the replacement. The reason for requiring two test cases to pass is that an abstraction that applies to only one situation is not useful. We believe that there is potential for improving this scheme by considering a confidence metric to express how well the tests cover the potential state space. In practice, there are three situations in which we can observe fewer than two test cases: 1. For very small user-specified maximum term sizes or very complex classes, our term generator may be unable to generate two distinct terms. In practice, users can increase the maximum term size for test generation to avoid this problem. 2. For classes whose objects appear immutable (i.e., whose objects we cannot distinguish by applying observers), our system will be unable to generate more than one test case. java.lang.Object is the most straightforward example of such a class. In practice, such classes are rare; where they arise, users can specify additional constant pools to avoid this problem. 3. Insufficiently large constant pools (Section 3.2.1) may also be responsible for generating too few test cases. While all of our default constant pools are suffi25
ciently large, users might forget to specify a second object instance when constructing a constant pool. At the end, we explicitly quantify all variables and return the axiom.
3.6 Axiom Redundancy Elimination by Axiom Rewriting The axiom generator (Section 3.5) generates many redundant axioms. For example, for the ObjectStack algebra, our generator may generate both ∀x4 : ObjectStack, ∀i4 , j4 : Object pop(pop(push(push(x4 , i4 ).state, j4 ).state).state).state =gte x4
(4)
and ∀x5 : ObjectStack ∀i5 : Object . pop(push(x5 , i5 ).state).state =gte x5
(5)
We eliminate redundant axioms using term rewriting [9]. As rewriting rules, we chose all equality-based axioms where (i) the right-hand size has a smaller term size than the left-hand side; and (ii) the free variables occurring in the right-hand side term are a subset of the free variables occurring in the left-hand side term. When using a rewrite rule on an axiom, we try to unify the left-hand side of the rewrite rule with terms in the axiom. If there is a match, we replace the term with the right-hand side of the rewrite rule. Thus, we guarantee that the results of rewritings contain no free variables by condition (ii), while condition (i) guarantees termination. We also allow throws clauses to be used for rewriting (with appropriate semantics), though we have not found this feature to be necessary for the set of axioms our system currently generates. Whenever we are about to add a new axiom we note if any of the existing rewrite rules can simplify the axiom. If we find a rewriting which makes the axiom redundant or trivial we discard it. If a rewritten axiom yields a rewrite rule then we use that rule to simplify all existing axioms, again discarding redundant and trivial results. This process terminates since each rewriting application reduces the length of the terms of
26
the axioms that it rewrites, which means that in general, the addition of an axiom can only lead to a finite number of rewriting and elimination steps. We now sketch how to rewrite the example Axioms (4) and (5) as shown above. Suppose that Axiom (4) already exists and we are about to add Axiom (5). First, we try to rewrite Axiom (5) using Axiom (4) as a rewriting rule. Unfortunately, since the left (longer) term of Axiom (5) does not unify with any subterm in Axiom (4), rewriting fails. We find that Axiom (5) does not already exist and it is not a trivial axiom, so we add it to the set of known axioms. Since Axiom (5) is a rewriting rule, we try to rewrite all existing axioms, namely Axiom (4). We find that the left side of Axiom (5) unifies with the following subterm of Axiom (4) pop(push(push(x4 , i4 ).state, j4 ).state).state with the unifier {x5 → push(x4 , i4 ).state, i5 → j4 } . Therefore, we instantiate the right-hand side of Axiom (5) with the unifier we found and obtain push(x4 , i4 ).state which we use as a replacement for the subterm that we found in Axiom (4). Therefore, Axiom (4) rewrites to ∀x4 : ObjectStack, ∀i4 : Object ( pop(push(x4 , i4 ).state).state =gte x4 )
(6)
which is equivalent to Axiom (5). Since the rewritten Axiom (4) is identical to Axiom (5) (modulo renaming of variables), we eliminate Axiom (4). In summary, we end up with Axiom (5) as the only axiom in the final set of axioms.
3.7 Discussion Our system can discover algebraic specifications from Java classes. The system is dynamic, which means that the output is potentially unsound (i.e., may include incorrect 27
axioms) or incomplete (i.e., may miss some axioms). In practice, we can approximate soundness up to a certain degree (cf. Section 4.3) by increasing the maximum size for the generated test stubs, though this may be costly. The class of axioms our tool can generate is (i) the class of equations and throws clauses generated by our equation generators and (ii) abstractions of these equations obtainable by substituting subterms by free variables greedily. The main goal of our tool is to provide a good starting point in documenting code. Section 4 shows that our tool indeed finds many useful axioms.
4 Evaluation This section first describes our experimental methodology (Section 4.1) and then evaluates our approach along the following dimensions: (i) Is our approach fast enough for practical use (Section 4.2); (ii) Does our approach produce correct results (Section 4.3); and (iii) Does our approach produce complete documentation (Section 4.4).
4.1 Methodology We conducted our evaluations on a Dell PowerEdge 2900 Xeon 5060 “Woodcrest” 3.0 GHz dual-CPU dual-core workstation with 8 GB of RAM running Gentoo Linux and Sun JDK 1.5.0. Presently, our system only takes very limited advantage of the multi-core features of this hardware (Section 3.3.5). For our experiments we configured our system as follows. We limited the size of terms so that the resulting relations have terms with a maximum size of five on either side. We also limited the size of the terms we use for validating axioms (plugged in for universally quantified variables) to five. Finally, we limited the size of the test stubs (used for observational equality testing, Section 3.4.4) to three. Table 2 describes our benchmark programs. Column “# op” gives the number of operations in the class and “# obs” gives the number of operations that are observers (i.e., operations with a non-void return type). LinkedList, HashSet, 28
HashMap, and Hashtable are library classes from Sun’s Java Development Kit 1.5.0. IntegerStack and IntegerStack2 are two different implementations of an integer stack (with push and pop operations). Since our system disregards the static program structure and operates dynamically, it finds the same axioms for both implementations, except in one instance where the two are different (they raise different exceptions when popping off an empty stack).
Java Class ObjectStack IntegerStack IntegerStack2 PriorityQueue FullObjectStack IntegerQueue ObjectMapping ObjectQueue HashSet HashMap Hashtable ArrayList LinkedList
Table 2: Java classes used in our evaluation Description Author object stack Henkel minimal integer stack Henkel fast integer stack Henkel priority queue Reichenbach another Object stack Henkel FIFO queue for integers Henkel mapping between objects Henkel FIFO queue for objects Henkel hash set Sun hash map Sun hash table Sun list represented as array Sun linked list Sun
# op 12 13 13 15 16 16 16 16 27 26 29 33 46
# obs 6 6 6 9 9 9 10 9 21 19 22 26 37
4.2 Performance We evaluate the performance of our system in two ways: its execution time performance (Section 4.2.1) and its ability to reject redundant or incorrect axioms (Section 4.2.2). 4.2.1 Execution time performance Table 3 gives the execution time of our system. Columns “unoptimized” and “optimized give the execution time (in seconds) of our system without and with optimizaunoptimized tions respectively. Column side-effect opt gives the benefit of avoiding side-effect free operations during term generation (Section 3.2.2). The numbers in this column are the ratio of execution time without this optimization to the execution time with this op29
unoptimized timization. Column stub reuse opt analogously gives the benefit of reusing test stubs unoptimized (Section 3.4) and Column both opt gives the benefit of avoiding side effect free operations and reusing test stubs. From Column “optimized” we see that our system is quite fast, taking a few seconds for the smaller classes and a few hours for the most complex classes. unoptimized From Column side-effect opt we see that avoiding side effect free operations consistently improves the performance for most benchmarks, yielding speedups of up to 1.09. This speedup comes about because avoiding side-effect free operations enables our system to skip between 8.3% and 52.5% of the terms during term generation. unoptimized From Column stub reuse opt we see that reusing test stubs speeds up our system for 6 benchmarks and slows down the system for 4 benchmarks. Generally, the speedups due to this optimization are quite dramatic, yielding a maximum speedup of 2.44. Thus, even though the overhead incurred by this optimization sometimes degrades performance, the optimization is generally worthwhile, as the degradation tends to be small compared to the benefit. The benefit due to this optimization comes about because it significantly speeds up EQN (observational term equivalence) which accounts for a significant fraction of the overall execution time of our system. unoptimized From Column both opt we see that the benefit of doing both optimizations is greater than doing just one. Thus both optimizations are worthwhile. 4.2.2 Axiom performance Table 4 gives the number of axioms that our system discovers before and after rewriting. Comparing the “After rewriting” and “Before rewriting” columns we see that the rewriting is highly effective at eliminating redundant axioms, eliminating 88.3% to 99.6% of the discovered axioms. We see from Table 4 that LinkedList has the largest number of axioms– 139. While this is a large number of axioms, it is worth noting that the the linked list has 46 operations (including inherited ones), and thus 139 axioms is not excessive.
30
Table 3: Execution time for axiom discovery with and without optimizations. Benchmark ObjectStack IntegerStack IntegerStack2 PriorityQueue FullObjectStack IntegerQueue ObjectQueue ObjectMapping HashSet HashMap Hashtable ArrayList LinkedList
unoptimized (seconds) 2.60s 1.09s 1.23s 2.09s 3.68s 6.58s 10.7s 1.12s 2,290s 7,362s 8,859s 462s 598s
optimized (seconds) 2.68s 1.15s 1.41s 2.01s 3.71s 6.42s 10.7s 1.11s 1,751s 6,251s 7,706s 303s 218s
unoptimized side-effect opt
unoptimized stub reuse opt
unoptimized both opt
1.00 0.98 1.01 0.97 1.02 1.03 1.01 0.99 1.05 1.05 1.09 1.07 1.07
0.97 0.98 0.87 1.04 1.00 1.00 0.99 1.00 1.24 1.10 1.07 1.44 2.44
0.97 0.95 0.87 1.04 0.99 1.03 1.01 1.01 1.31 1.18 1.15 1.53 2.74
Table 4: Term rewriting. # of axioms Benchmark Before rewriting After rewriting ObjectStack 121 14 IntegerStack 114 13 IntegerStack2 127 15 PriorityQueue 251 30 FullObjectStack 171 27 IntegerQueue 376 29 ObjectQueue 363 26 ObjectMapping 249 23 HashSet 8,658 67 HashMap 12,829 40 Hashtable 12,171 47 ArrayList 4,107 107 LinkedList 1,635 139
31
Table 5: Invalidation of proposed axioms axioms benchmark proposed validated avg. # of tests ObjectStack 583 415 (71.2%) 22.2, σ = 29.2 IntegerStack 599 380 (63.4%) 19.8, σ = 38.5 IntegerStack2 667 417 (62.6%) 21.3, σ = 42.0 PriorityQueue 1,223 734 (60.0%) 16.1, σ = 18.9 FullObjectStack 833 517 (62.1%) 19.7, σ = 24.0 IntegerQueue 2,313 992 (42.9%) 13.5, σ = 14.5 ObjectQueue 1,956 1,099 (56.2%) 22.3, σ = 25.5 ObjectMapping 1,261 802 (63.7%) 13.5, σ = 14.8 HashSet 65,107 28,525 (43.8%) 30.3, σ = 39.9 HashMap 97,012 42,283 (43.6%) 39.3, σ = 40.1 Hashtable 88,441 36,751 (41.6%) 36.8, σ = 37.3 ArrayList 27,218 10,620 (39.0%) 32.6, σ = 38.7 LinkedList 9,561 3,846 (40.2%) 16.7, σ = 19.0
Table 5 gives the number of axioms that our system comes up with originally (Column “proposed”) and the number of axioms that survive testing (“validated”). An axiom survives testing if we are unable to find a test for which the axiom does not hold. Column “avg. # of tests” gives the number (and variance) of tests it took, on average, to reject an axiom. The “validated” column is not the same as Column “Before rewriting” in Table 4 because our system tries to validate many abstractions of an axiom before it attempts to rewrite it. From Table 5 we see that our system keeps 39.0% to 71.2% of the axioms that it tries; for the rest it is able to find a counterexample. Also, our system requires, on average, 13.5 to 39.3 tests to determine that an axiom is incorrect.
4.3 Soundness Since our system uses a fully dynamic approach, it is not guaranteed to be sound. This means that our system may discover some incorrect axioms. Therefore, the documentation discovered by our system must be verified by a human developer in practice, either by informal inspection or by other mechanical means (such as a theorem proof
32
Table 6: Correctness of discovered axioms (based on manual inspection). Axioms: Class Discovered Rewritable Incorrect % Correct ObjectStack 14 1 1 92.9% IntegerStack 13 0 1 92.3% IntegerStack2 15 2 1 93.3% PriorityQueue 30 0 1 96.7% FullObjectStack 27 1 1 96.3% IntegerQueue 29 6 1 96.6% ObjectQueue 26 3 1 96.2% ObjectMapping 23 0 1 95.7% HashSet 67 5 1 98.5% HashMap 40 3 1 97.5% Hashtable 47 0 1 97.9% ArrayList 107 11 9 91.6% LinkedList 139 24 1 99.3%
assistant). However, even incorrect axioms are of some value since they hold for many (but not all) runs of the class and may therefore point towards a potential conditional axiom. Table 6 gives the number of axioms that our system discovers for our benchmark classes (column “Discovered”) and the number of those axioms that are incorrect (column “Incorrect”, determined using manual inspection). The table further lists the number axioms which we believed to be potentially rewritable (column “Rewritable”) with a more sophisticated rewriting system or built-in basic arithmetic: these axioms were correct but not as general as possible, or potentially subsumable by other detected axioms. The axioms were quite easy to read; it took at most half an hour to inspect the axiom set for one class. We see that the vast majority (91.6% to 98.6%) of the axioms discovered by our system are correct. Incorrect axioms fall into two categories: an incorrect equals axiom and axioms which our system did not fully test. We give an example of the incorrect equals axiom below (from PriorityQueue): forall x0 : edu.colorado.cs.PriorityQueue
33
(Axiom 4)
forall x1 : java.lang.Object equals(x0,x1).retval =gte false Our system discovered this axiom because our dynamic testing does not currently consider subtyping when choosing test values. As such, we never attempted to pass a priority queue to equals and thus never encountered a situation in which equals would evaluate to true. We discovered this axiom (with appropriately varying types for x0) for all of our classes. Here is an example of an incorrect axiom from ArrayList which arose because our system did not use large enough terms when testing the axioms: forall x0:java.util.ArrayList forall x1:int subList(x0,x1,3).state throws java.lang.IndexOutOfBoundsException For test term size of 5 or less, the above axiom holds: it will always throw the exception. To invalidate this term, we need an ArrayList with at least four elements; the smallest term needed to construct such a term has a size of 9 (one constructor plus four add operations, each of the latter taking one additional parameter). While this axiom is not universally correct, its presence suggests that there is a conditional axiom that we are missing. Users can thus use the invalid axiom to manually derive its correct, conditional generalization: forall x0:java.util.ArrayList forall x1:int forall i:int i >= size(x0) ==> subList(x0,x1,i).state throws java.lang.IndexOutOfBoundsException Our system provides two means by which users can help reduce such incorrectness when they encounter it. First, users can provide concrete objects to be used in the instance pool for a given class. Second, users can map classes to subclasses for purposes of term generation. In our tests, we found no need to take advantage of the
34
first technique, but we applied the second technique in several cases to map the type java.util.Collection to whichever class we were analyzing at a given point in time. Since many operations on Java container classes take Collection objects as parameters, this enabled us to find additional axioms.
4.4 Completeness Our system is not guaranteed to be complete. This means that it may fail to discover some axioms. Several aspects of our system can give rise to incompleteness, such as its dynamic nature or it not discovering conditional axioms. The remainder of this section describes our attempts to evaluate the completeness of our discovery tool. Unfortunately, it is much harder to determine if our system discovered all the axioms than to determine if the discovered axioms are correct. Thus, we use three techniques in our attempt to evaluate the completeness of the discovered documentation. 4.4.1 Evaluating completeness using coverage One way in which the discovered axioms may be incomplete is if the term generation fails to exercise some of the code of the class under consideration. Table 7 gives the basic block coverage attained by our term generator.2 Overall, we find that our terms yield a high coverage. When our coverage is less than 100% it is often because our terms do not adequately explore the argument space for operations (e.g., for LinkedList we did not generate a term that called removeAll with a collection that contains a subset of the receiver’s elements, due to the user-specified term size limitation). Other reasons include dead code (e.g., some code in LinkedList.subList) and corner cases (e.g., corner cases for initialCapacity in Hashtable), as well as non-default constructors. A low coverage may be an indication of missed or even incorrect axioms. Thus, if a user notes that the coverage is poor for a class she can increase the aggressiveness 2 The numbers for IntegerStack and IntegerStack2 were incorrectly swapped in our earlier paper[6] due to a transcription error.
35
Table 7: Basic block coverage. Class Coverage ObjectStack 100% IntegerStack 100% IntegerStack2 62.5% PriorityQueue 63.6% FullObjectStack 100% IntegerQueue 100% ObjectQueue 100% ObjectMapping 100% HashSet 86.3% HashMap 69.1% Hashtable 78.6% ArrayList 87.1% LinkedList 88.7%
of term generation (e.g., by increasing the maximum term size), provide values that the system can use to generate better terms, or map classes to subclasses for term generation (as discussed in Section 4.3). Note that code coverage is a sanity check only: while low coverage suggests that the generated specifications are incomplete, the reverse is not necessarily true. 4.4.2 Determining completeness using interpretation We used our algebraic specification interpreter [7] to run the discovered specification for the ArrayList class. A client of the ArrayList class (we used a BibTeX parser) can use the interpretation of the specification of ArrayList as if it were the ArrayList class itself. If the interpretation cannot proceed, it indicates that one or more axioms are missing. Out of the 10 algebraic axioms needed to execute the BibTeX parser successfully, our discovery tool produced 2 axioms exactly as needed; we manually added 8 axioms to this set. As an example, the following two axioms specify how the first element of an ArrayList can be obtained by applying the get operation for index 0:
36
forall x0 : Object (Axiom 5) get(add(newArrayList().state, x0).state, 0).retval == x0 forall l : ArrayList forall o1 : Object forall o2 : Object get(add(add(l, o1).state, o2).state, 0).retval == get(add(l, o1).state, 0).retval
(Axiom 6)
While the system generated the first axiom, we manually added the second axiom. The discovery tool could not detect the second axiom because it does not currently generate an equation between two observers for non-primitive types 3 However, our system provides an extension mechanisms for new equation generators (Section 3.3.5). To address the above limitation, we could add a straightforward equation generator that produces equations between arbitrary observer terms. Of the remaining manually added axioms, 5 axioms describe the behavior of Iterator instances generated by ArrayList objects. For example, the following axiom states that an iterator created from an empty list does not have a next element: hasNext(iterator(ArrayList().state).retval).retval == false The discovery tool currently cannot find these 5 axioms because the state of the Iterator object is modeled as the return value of the operation iterator() of another class (ArrayList). None of our current equation generators covers this scenario. Again, a single equation generator, added to the system, could support this particular class of axioms. Our technical report gives more details of this case study [13]. 3 Our earlier belief that this axiom could be discovered by increasing the term size[13] turned out to be unjustified.
37
4.4.3 Determining completeness by comparing to manually written specification In this case study, we compared a discovered specification for a priority queue class to a manual specification of the same class. Since we had significant experience with the manual specification, we expect it to be complete. Out of the 30 axioms we manually specified our system discovered 8 axioms exactly as originally defined. In addition, it discovered slightly more concrete versions of another 6 axioms. These axioms used the default constructor even though the class provided a more general constructor which could be parameterized by a comparator (an object representing a partial order). However, our system does not presently use non-default constructors, and therefore assumed the default behavior as being the only possible behavior. Our system failed to detect: • 7 axioms because they were conditional, i.e. they held only if a certain precondition was met– thus, they had to be expressed in a language which is more expressive than the one our system can detect. Whenever our system fails to discover conditional axioms, it usually discovers an overly concrete axiom which tells the user that a conditional axiom is needed. Our system can be configured to emit such concrete axioms in addition to the axioms we have reported on so far. For example, here is such a concrete axiom from HashMap: forall x0 : Object (Axiom 7) get(put(HashMap().state, Object@0, x0).state, Object@1).retval =gte null
• 6 axioms because it did not consider subtype instances when generating terms. As discussed above, the user can alleviate this situation by providing values for the precomputed sets, or by mapping supertype to subtypes for which the system can generate values.
38
• 2 axioms because they were based on an auxiliary definition in the manual specification which had no representation in the priority queue implementation. Concretely, the manual specification contains a “hidden operation” which specifies how the comparator object can be extracted from a priority queue. This operation is “hidden” in the sense that it does not form part of the PriorityQueue API and merely arose as a specification artifact. While inferring such auxiliary operations is an interesting challenge, such a process is beyond our present scope. • 1 axiom because none of our relation generators emitted an equation that could have served as a template for it. The axiom was as follows: forall p : edu.colorado.cs.PriorityQueue forall q : edu.colorado.cs.PriorityQueue forall o : java.lang.Object addAll(p, add(q, o).state).state =gte add(addAll(p, q).state, o).state None of our equation generators yielded an equation that could have been generalized to this axiom. Our non-hashing equation generator cannot find the above axiom because hashing already equates the above terms, as per Optimization 2 (Section 3.2.3). Our current hash-based equation generator only records equations between arbitrary terms and the smallest term with the same state. For the above, the hash-based equation generator finds addAll(PriorityQueue().state, add(PriorityQueue().state, 1).state).state =gte add(PriorityQueue().state, 1).state (7) and, separately, add(addAll(PriorityQueue().state, PriorityQueue().state).state, 1).state =gte add(PriorityQueue().state, 1).state (8) The generator finds correct abstractions for both of the above equations, but since it does not perform unification on them, it cannot find the axiom we expected.
39
We can exploit the extensibility of our system to easily add a new relation generator which will find the above axiom.
4.5 Summary While our approach is potentially unsound and incomplete, we find that it • Rarely produces incorrect axioms. Moreover, the incorrect axioms are easy to pick out and often can be avoided with minor user intervention. • Finds many but not all the axioms. Some of the missing axioms can be found with minor user intervention. The remaining missing axioms point to weaknesses of our current implementation, such as lack of support for conditional axioms.
5 Discussion The central properties of our approach are that 1. It is fully dynamic, though users can guide it if they wish (Section 3). 2. It expresses its output, the documentation, using algebraic specifications. This section discusses the advantages and disadvantages of the above properties.
5.1 Properties of Fully Dynamic Document Discovery Mechanisms The main disadvantage of a fully dynamic discovery mechanism is that it is not guaranteed to produce correct or complete results. At best, it can (i) use a large body of tests (terms in our case) to decrease the chances of missing some results; and (ii) use extensive testing of its results to reduce the likelihood of incorrect results. Our system provides a number of mechanisms via which the user can manage the quality and quantity of term and test case generation. There are two main advantages to dynamic discovery mechanisms. 40
First, dynamic mechanisms require a comparatively small “trusted base” for their discovery. Whereas static or combined analyses must assume the correctness of their supporting static analyses, a purely dynamic discovery system can rely largely on existing evaluation mechanisms. In the case of Java, we can use one (or even several) of the many existing virtual machines for dynamic analysis. Second, dynamic analyses can depend only on observations and ignore implementation details. For example, our system discovered nearly identical documentation for IntegerStack and IntegerStack2 even though the two have different implementations.4 Since implementations may be arbitrarily complex and program equivalence is undecidable, static analyses cannot hope to be free of implementation bias in general.
5.2 Properties of Algebraic Specifications for Classes We encountered three main problems with algebraic specifications: • Algebraic specifications do not provide a good means for addressing the frame problem: To algebraically specify the behavior of an object with k fields, x1 through xk , and corresponding setter and getter methods, each of the k getter and setter methods must be specified explicitly as not affecting the results of the remaining getters and setters, i.e., each getter and setter method requires 2(k−1) properties in addition to its relevant properties. In other words, algebraic specifications cannot easily express the idea that “x does not change”. • Algebraic specifications cannot easily express side effects on arguments. To address this problem, our algebraic specifications can be extended with additional selectors (on top of .state and .retval) to refer to arguments. However, note that such an extension would exacerbate the frame problem, as arguments would no longer “by default” be unaltered. 4 The documentation was different only with respect to the error handling: the two implementations used different exceptions to signal errors, and our system reported this fact.
41
• Algebraic specifications cannot easily model the interactions between classes. This was one of the most common reasons why our system missed some axioms: specifically, we were unable to model iterators easily with algebraic specifications. The main advantage of algebraic specifications is that even though they are simple, which makes them useful as documentation, they are highly expressive. Specifically, prior work has found that algebraic specifications are particularly suited for describing container classes [1]. We have found that using only equality and exception-throwing axioms, we could describe much of interesting behavior of container classes. Other kinds of axioms (such as ones that use the less-than relation) can contribute in some instances, such as describing the effects of set union on set size. However, we can treat any relation a < b as an equation lessThan(a, b).retval == true. Since we also found only limited use for such relations, we did not include them in our default relation generation setup.
6 Related Work We now describe related work in specification languages, dynamic invariant detection, automatic programming, static analysis, and testing.
6.1 Specification Languages We drew many ideas and inspirations from previous work in algebraic specifications for abstract data types [1]. Horebeek and Lewi [14] give a good introduction to algebraic specifications. Sannella et al. give an overview of and motivation for the theory behind algebraic specifications [15]. A book by Astesiano et al. contains reports of recent developments in the algebraic specification community [16]. Antoy describes how to systematically design algebraic specifications [17]. In particular, he describes techniques that can help to identify whether a specification is com-
42
plete. His observations could be used in our setting; however, they are limited to a particular class of algebras. Prior work demonstrates that algebraic specifications are useful for a variety of tasks. Rugaber et al. study the adequacy of algebraic specifications for a re-engineering task [18]. They specified an existing system using algebraic specifications and were able to regenerate the system from the specifications using a code generator. Janicki et al. find that for defining abstract data types, algebraic specifications are preferable over the trace assertion method [19, 20].
6.2 Typestates Typestate [21, 22] specifications describe (sets of) FSAs. The states used for any given FSA are abstractions over the possible state an object of the given class can be in, and its edges are subsets of the set of operations. Typestate specifications are highlevel specifications in the sense that the states they describe (such as “initialized” or “empty”) are meaningful to clients in the absence of information about implementation details. We consider typestate specifications to be somewhat orthogonal to algebraic specifications: while typestate specifications provide an elegant solution for the frame problem (Section 5.2), they do not provide a means for expressing state which must be modeled via recursive data types. Indeed, combining the two specification mechanisms might address their individual shortcomings.
6.3 Dynamic Invariant Detection Recently, there has been much work on dynamic invariant detection [2, 5, 3, 4]. Dynamic invariant detection systems discover specifications by learning general properties of a program’s execution from a set of program runs. Daikon [2, 23] discovers Hoare-style axiomatic specifications [24, 25]. Daikon is useful for understanding how something is implemented, but also exposes the full
43
complexity of a given implementation. Daikon has been improved in many ways [26, 27, 28, 29, 30, 31] and has been used for various applications including program evolution [2], refactoring [32], test suite quality evaluation [33], bug detection [4], and as a generator of specifications that are then checked statically [34]. Recent versions of Daikon use a mechanism to efficiently describe an “inheritance” hierarchy of properties between different program points (Relation (vD ) [29]). Our system could be extended with a similar approach to support axioms shared within a type hierarchy. Artzi et al. [30] describe a mechanism for detecting legal test inputs to programs, for the purpose of generating test cases automatically. Their FSA-based approach can model complex input sequences, but requires an existing test client.
6.4 Dynamic Specification Discovery Guo et al. [31] discusses the inference of “abstract types”, a form of specification similar to the type qualifiers used by Lackwit [35]; this allows them to perform more fine-grained type-based partitioning of expressions (including variables and parameters) by refining types by their applications. Such information could be used to reduce our tool’s search space as part of a static pre-processing pass. Whaley et al. [5] describe how to discover specifications that are finite state machines describing in which order method calls can be made to a given object. Similarly, Ammons et al. extract nondeterministic finite state automatons (NFAs) that model temporal and data dependencies in APIs from C code [3]. These specifications are not nearly as expressive as algebraic specifications, since they cannot capture what values are returned by the methods. Our preliminary studies show that the current implementation of our tool does not scale as well as some of the systems mentioned above. However, we are unaware of any dynamic tool that discovers high-level executable specifications [36] of the interfaces of classes. Also, unlike prior work, our system interleaves automatic test generation and specification discovery. All previous systems require a test suite.
44
6.5 Automatic Programming Automatic programming systems [37, 38, 39, 40, 41, 42] discover programs from examples or synthesize programs from specifications by deduction. The programs are analogous to our documentation in that our documentation gives high-level descriptions of examples. Algorithmic program debugging [43] is similar to automatic programming and uses an inductive inference procedure to test side-effect and loop-free programs based on input output examples and then helps users to interactively correct the bugs in the program. Unlike the above techniques whose goals are to generate programs or find bugs, the goal of our system is to generate formal specifications (which could, of course, be used to generate programs or find bugs).
6.6 Static Analysis Program analyses generate output that describes the behavior of the program. For example, shape analyses [44] describe the shape of data structures and may be useful for debugging. Type inference systems, such as Lackwit [35] generate types that describe the flow of values in a program. Most recently, static analysis has been applied towards the discovery of program specifications, both for low-level invariants [45] and for high-level specifications: Tillmann et al. describe Specification Meister [46], a static program analysis tool which infers high-level specifications by symbolic execution. Their mechanism explores possible code paths, then merges and simplifies observed properties. Zhenmin and Yuanyuan use a very different approach in their PR-Miner [47], namely data mining techniques, to determine valid sequences of calls. Data mining relies on the presence of sufficiently many existing clients to determine how objects are normally used. Nanda et al. [48] describe a system for the static discovery of typestates (see above). Our system is a dynamic black-box technique that does not need to look at the code to be effective. However, various static techniques can be used to guide our system
45
(for example, we have already experimented with mod-ref analyses to help identify immutable classes).
6.7 Testing Woodward describes a methodology for mutation testing algebraic specifications [49]. Mutation testing introduces one change (“mutations”) to a specification to check the coverage of a test set. Woodward’s system includes a simple test generation method that uses the signatures of specifications. Algebraic specifications have been used successfully to test implementations of abstract data types [50, 51, 52, 53]. One of the more recent systems, Daistish [52] allows for the algebraic testing of OO programs in the presence of side effects. In Daistish, the user defines a mapping between an algebraic specification and an implementation of the specification. The system then checks whether the axioms hold, given user-defined test vectors. Similarly, the system by Antoy et al. [53] requires the users to give explicit mappings between specification and implementation. Our system automatically generates both the mapping and the test suite. Doong and Frankl [10] introduce the notion of observational equivalence and an algorithm that generates test cases from algebraic specifications. Their system semiautomatically checks implementations against generated test cases. Later work improved on Doong and Frankl’s test case generation mechanism [54] by combining white-box and black-box techniques. Our tool can potentially benefit from employing static analysis of the code when generating test cases (white box testing). In addition to the above, prior work on test-case generation [55, 56, 57] is also relevant to our work, particularly where it deals with term generation. Also, methods for evaluating test suites or test selection [58, 59] are relevant. We do not use these techniques yet but expect that they will be useful in improving the speed of our tool and the quality of the axioms. Korat is a system for automated testing of Java programs [60]. Korat translates a
46
given method’s pre- and post-conditions into Java predicates, generates an exhaustive set of test cases within a finite domain using the pre-condition predicate and checks the correctness of the method by applying the post-condition predicate. Our approach for generating terms borrows ideas from Korat.
7 Conclusion Java applications rely heavily on reusable container classes. In our study of a number of large Java applications we found that up to 30% of the classes in our applications used one or more of the containers in the java.util package. Thus, it is important for container classes to be well documented. To assist in this task, we described a documentation discovery tool that automatically discovers algebraic specifications for Java classes. Since algebraic specifications provide a clear, concise, and unambiguous description of the interface of container classes, they are ideal for documenting these classes. The specification discovery tool works by observing the run-time behavior of objects. Our experiments with a number of Java classes reveal that our system generates axioms that are both largely correct and useful for understanding and using container classes. Since our discovery tool is a dynamic tool (i.e., based on observing run-time behavior), it is neither sound nor complete. However, by being dynamic it avoids having its observations depend on implementation details. Finally, we show that for our benchmarks, the tool produces many useful axioms and that the axioms are usually correct.
References [1] J. V. Guttag and J. J. Horning, “The algebraic specification of abstract data types,” Acta Informatica, vol. 10, pp. 27–52, 1978. [2] M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin, “Dynamically discovering likely program invariants to support program evolution,” ACM Transactions on Software Engineering, vol. 27, no. 2, pp. 1–25, Feb. 2001.
47
[3] G. Ammons, R. Bodik, and J. R. Larus, “Mining specifications,” in Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2002, pp. 4–16. [4] S. Hangal and M. S. Lam, “Tracking down software bugs using automatic anomaly detection,” in Proceedings of the International Conference on Software Engineering, May 2002, pp. 291–301. [5] J. Whaley, M. C. Martin, and M. S. Lam, “Automatic extraction of object-oriented component interfaces,” in Proceedings of the International Symposium of Software Testing and Analysis, 2002. [6] J. Henkel and A. Diwan, “Discovering algebraic specifications from Java classes,” in ECOOP 2003 - Object-Oriented Programming, 17th European Conference, L. Cardelli, Ed. Darmstadt: Springer, July 2003. [Online]. Available: citeseer.nj.nec.com/henkel03discovering.html [7] ——, “A tool for writing and debugging algebraic specifications,” in International Conference on Software Engineering (ICSE), May 2004, pp. 449–458. [8] J. A. Goguen, J. W. Thatcher, and E. G. Wagner, “An initial algebra approach to the specification, correctness and implementation of abstract data types,” in Current Trends in Programming Methodology, R. T. Yeh, Ed., Englewood Cliffs, N.J., 1978, vol. 4, pp. 80–149. [9] J. C. Mitchell, Foundations of Programming Languages. MIT Press, 1996. [10] R. Doong and P. G. Frankl, “The ASTOOT approach to testing object-oriented programs,” ACM Transactions on Software Engineering, vol. 3, no. 2, Apr. 1994. [11] S. Kamin, “Final data types and their specification,” ACM Trans. Program. Lang. Syst., vol. 5, no. 1, pp. 97–121, 1983. [12] G. D. Plotkin, “Call-by-name, call-by-value and the [lambda]-calculus,” Theoretical Computer Science, vol. 1, no. 2, pp. 125–159, December 1975. [Online]. Available: http://dx.doi.org/10.1016/0304-3975(75)90017-1 [13] J. Henkel and A. Diwan, “Case study: Debugging a discovered specification for java.util.arraylist by using algebraic interpretation,” University of Colorado at Boulder, Tech. Rep. CU-CS-970-04, 2004. [14] I. V. Horebeek and J. Lewi, Algebraic specifications in software engineering: an introduction. Springer-Verlag, 1989. [15] D. Sannella and A. Tarlecki, “Essential concepts of algebraic specification and program development,” Formal Aspects of Computing, vol. 9, pp. 229–269, 1997. [Online]. Available: ftp://ftp.dcs.ed.ac.uk/pub/dts/concepts.ps [16] E. Astesiano, H.-J. Kreowski, and B. Krieg-Br¨uckner, Eds., Algebraic Foundations of Systems Specification. Springer, 1999. [17] S. Antoy, “Systematic design of algebraic specifications,” in Proceedings of the Fifth International Workshop on Software Specification and Design, Pittsburgh, Pennsylvania, 1989. [18] S. Rugaber, T. Shikano, and R. E. K. Stirewalt, “Adequate reverse engineering,” in Proceedings of the 16th Annual International Conference on Automated Software Engineering, 2001, pp. 232–241. 48
[19] W. Bartussek and D. L. Parnas, “Using assertions about traces to write abstract specifications for software modules,” in Information Systems Methodology: Proceedings, 2nd Conference of the European Cooperation in Informatics, Venice, October 1978; Lecture Notes in Computer Science, vol. 65. Springer Verlag, 1978. [20] R. Janicki and E. Sekerinski, “Foundations of the trace assertion method of module interface specification,” ACM Transactions on Software Engineering, vol. 27, no. 7, July 2001. [21] R. DeLine and M. Fahndrich, “Typestates for objects,” 2004. [Online]. Available: citeseer.ist.psu.edu/deline04typestates.html [22] P. Lam, V. Kuncak, and M. Rinard, “Generalized typestate checking using set interfaces and pluggable analyses,” SIGPLAN Not., vol. 39, no. 3, pp. 46–55, 2004. [23] M. D. Ernst, J. H. Perkins, P. J. Guo, S. McCamant, C. Pacheco, M. S. Tschantz, and C. Xiao, “The Daikon system for dynamic detection of likely invariants,” Science of Computer Programming, 2006. [24] C. A. R. Hoare, “An axiomatic basis for computer programming,” Communications of the ACM, vol. 12, no. 10, pp. 576–580, 1969. [25] D. Gries, The science of programming, ser. Texts and monographs in computer science. Springer-Verlag, 1981. [26] M. D. Ernst, A. Czeisler, W. G. Griswold, and D. Notkin, “Quickly detecting relevant program invariants,” in Proceedings of the 22nd International Conference on Software Engineering, June 2000, pp. 449–458. [27] M. D. Ernst, W. G. Griswold, Y. Kataoka, and D. Notkin, “Dynamically discovering program invariants involving collections,” University of Washington, TR UW-CSE-99-11-02, 2000, revised version of March 17, 2000. [28] N. Dodoo, A. Donovan, L. Lin, and M. D. Ernst, “Selecting predicates for implications in program analysis,” March 2002, http://pag.lcs.mit.edu/˜mernst/pubs/invariants-implications.pdf. [29] J. H. Perkins and M. D. Ernst, “Efficient incremental algorithms for dynamic detection of likely invariants,” in Proceedings of the ACM SIGSOFT 12th Symposium on the Foundations of Software Engineering (FSE 2004), Newport Beach, CA, USA, November 2–4, 2004, pp. 23–32. [30] S. Artzi, M. D. Ernst, A. Kie˙zun, C. Pacheco, and J. H. Perkins, “Finding the needles in the haystack: Generating legal test inputs for object-oriented programs,” in 1st Workshop on Model-Based Testing and Object-Oriented Systems (M-TOOS), Portland, OR, USA, October 23, 2006. [31] P. J. Guo, J. H. Perkins, S. McCamant, and M. D. Ernst, “Dynamic inference of abstract types,” in ISSTA 2006, Proceedings of the 2006 International Symposium on Software Testing and Analysis, Portland, ME, USA, July 18–20, 2006, pp. 255–265. [32] Y. Kataoka, M. D. Ernst, W. G. Griswold, and D. Notkin, “Automated support for program refactoring using invariants,” in International Conference on Software Maintenance, Florence, Italy, 2001.
49
[33] M. Harder, B. Morse, and M. D. Ernst, “Specification coverage as a measure of test suite quality,” septermber 25, 2001. [34] J. W. Nimmer and M. D. Ernst, “Automatic generation of program specifications,” in Proceedings of the 2002 International Symposium on Software Testing and Analysis (ISSTA), Rome, July 2002. [35] R. O’Callahan and D. Jackson, “Lackwit: A program understanding tool based on type inference,” in International Conference on Software Engineering, 1997, pp. 338–348. [Online]. Available: citeseer.nj.nec.com/329620.html [36] J. Henkel, C. Reichenbach, and A. Diwan, “Developing and debugging algebraic specifications for Java classes,” University of Colorado, Tech. Rep. CU-CS-98404, 2004. [37] A. W. Biermann, “On the inference of turing machines from sample computations,” Artificial Intelligence, vol. 3, pp. 181–198, 1972. [38] A. W. Biermann and R. Krishnaswamy, “Constructing programs from example computations,” IEEE Transactions on Software Engineering, vol. 2, no. 3, pp. 141–153, Sept. 1976. [39] A. W. Biermann, “The inference of regular Lisp programs from examples,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 8, pp. 585–600, Aug. 1978. [40] D. Angluin, “Inference of reversible languages,” Journal of the ACM (JACM), vol. 29, no. 3, pp. 741–765, 1982. [41] S. Hardy, “Synthesis of Lisp functions from examples,” in Proceedings of the Fourth International Joint Conference on Artificial Intelligence, 1975, pp. 240– 245. [42] P. D. Summers, “A methodology for lisp program construction from examples,” Journal of the ACM (JACM), vol. 24, no. 1, pp. 161–175, 1977. [43] E. Y. Shapiro, Algorithmic Program Debugging, ser. ACM Distinguished Dissertation 1982. MIT Press, 1982. [44] M. Sagiv, T. Reps, and R. Wilhelm, “Parametric shape analysis via 3-valued logic,” ACM Transactions on Programming Languages and Systems, vol. 24, no. 3, pp. 217–298, May 2002. [45] F. Logozzo, “Automatic inference of class invariants,” 2004. [Online]. Available: citeseer.ist.psu.edu/logozzo04automatic.html [46] N. Tillmann, F. Chen, and W. Schulte, “Discovering likely method specifications.” in ICFEM, 2006, pp. 717–736. [47] Z. Li and Y. Zhou, “Pr-miner: automatically extracting implicit programming rules and detecting violations in large software code,” in ESEC/FSE-13: Proceedings of the 10th European software engineering conference held jointly with 13th ACM SIGSOFT international symposium on Foundations of software engineering, vol. 30, no. 5. New York, NY, USA: ACM Press, September 2005, pp. 306–315. [Online]. Available: http://portal.acm.org/citation.cfm?id=1081755 [48] M. G. Nanda, C. Grothoff, and S. Chandra, “Deriving object typestates in the presence of inter-object references,” in OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming, systems, languages, and applications. New York, NY, USA: ACM Press, 2005, pp. 77– 96. 50
[49] M. R. Woodward, “Errors in algebraic specifications and an experimental mutation testing tool,” IEEE Software Engineering Journal, vol. 8, no. 4, pp. 237–245, July 1993. [50] J. Gannon, P. McMullin, and R. Hamlet, “Data-abstraction implementation, specification and testing,” ACM Transactions on Programming Languages and Systems, vol. 3, no. 3, pp. 211–223, 1981. [51] S. Sankar, “Run-time consistency checking of algebraic specifications,” in Proceedings of the Symposium on Testing, Analysis, and Verification, Victoria, British Columbia, Canada, Sept. 1991. [52] M. Hughes and D. Stotts, “Daistish: Systematic algebraic testing for OO programs in the presence of side-effects,” in Proceedings of the International Symposium on Software Testing and Verification, San Diego, California, 1996. [53] S. Antoy and D. Hamlet, “Automatically checking an implementation against its formal specification,” IEEE Transactions on Software Engineering, vol. 26, no. 1, Jan. 2000. [54] H. Y. Chen, T. H. Tse, F. T. Chan, and T. Y. Chen, “In black and white: An integrated approach to class-level testing of object oriented programs,” ACM Transactions on Software Engineering, vol. 7, no. 3, July 1998. [55] R. Helm, I. M. Holland, and D. Gangopadhyay, “Contracts: specifying behavioral compositions in object-oriented systems,” in Proceedings of the European conference on object-oriented programming on Object-oriented programming systems, languages, and applications. ACM Press, 1990, pp. 169–180. [56] U. Buy, A. Orso, and M. Pezze, “Automated testing of classes,” in Proceedings of the International Symposium on Software Testing and Analysis, Portland, Oregon, 2000. [57] V. Martena, A. Orso, and M. Pezze, “Interclass testing of object oriented software,” in Proc. of the IEEE International Conference on Engineering of Complex Computer Systems, 2002. [58] H. Zhu, P. A. V. Hall, and J. H. R. May, “Software unit test coverage and adequacy,” ACM Computing Surveys (CSUR), vol. 29, no. 4, pp. 366–427, 1997. [59] T. L. Graves, M. J. Harrold, J.-M. Kim, A. Porter, and G. Rothermel, “An empirical study of regression test selection techniques,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 10, no. 2, pp. 184–208, 2001. [60] C. Boyapati, S. Khurshid, and D. Marinov., “Korat: Automated testing based on java predicates.” in International Symposium on Software Testing and Analysis, Rome, Italy, July 2002.
51