Transforming the .NET Intermediate Language Using Path ... - CiteSeerX

5 downloads 43 Views 170KB Size Report
Path logic programming is a modest extension of Prolog for the specification of .... deed, this observation has prompted De Volder to propose the use of Prolog as an ...... Ganesh Sittampalam are supported by a research grant from. Microsoft ...
Transforming the .NET Intermediate Language Using Path Logic Programming Stephen Drape [email protected]

Oege de Moor [email protected]

Ganesh Sittampalam [email protected]

Oxford University Computing Laboratory Wolfson Building Parks Road Oxford OX1 3QD United Kingdom

ABSTRACT Path logic programming is a modest extension of Prolog for the specification of program transformations. We give an informal introduction to this extension, and we show how it can be used in coding standard compiler optimisations, and also a number of obfuscating transformations. The object language is the Microsoft .NET intermediate language (IL).

Categories and Subject Descriptors D.3.3 [Language Constructs and Features]: Patterns; D.3.2 [Language Classifications]: Constraint and logic languages

General Terms Languages.

Keywords Program Transformation, Meta Programming, Logic Programming, Program Analysis, Compiler Optimisations, Obfuscation.

1.

INTRODUCTION

Optimisers, obfuscators and refactoring tools all need to apply program transformations automatically. Furthermore, for each of these applications it is desirable that one can easily experiment with transformations, varying the applicability conditions, and also the strategy by which transformations are applied. This paper introduces a variation of logic programming, called path logic programming, for specifying program transformations in a declarative yet executable manner. A separate language of strategies is used for controlling the application order.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. PPDP’02, October 6-8, 2002, Pittsburgh, Pennsylvania, USA. Copyright 2002 ACM 1-58113-528-9/02/0010 ...$5.00.

We illustrate the ideas by considering the .NET intermediate language (IL), which is a typed representation used by the backends of compilers for many different programming languages [15, 22]. IL is quite close to some high-level languages, in particular to C# [2, 17], and because of the ease by which one can convert from IL to C#, obfuscation of IL is important [8]. Our main examples are therefore drawn from the literature on obfuscation, but we also consider a few standard compiler optimisations. The structure of this paper is as follows. First we provide a brief introduction to IL, and a variant of IL that is useful when applying program transformations (or indeed when writing a decompiler). Next, we introduce the main ideas of path logic programming, as extensions of standard Prolog, and explain how we can use these ideas to transform IL programs. After these preliminaries, we present some concrete examples, first a few simple optimisations, and then some more complex obfuscating transformations.

2. .NET INTERMEDIATE LANGUAGE (IL) The core of Microsoft’s .NET platform is an intermediate language to which a variety of different source languages can be compiled. It is similar to Java bytecode, although rather more complex because it has been specifically designed to be a convenient compilation target for multiple source languages. Programs are distributed in this intermediate language and just-in-time compiled to native code on the target platform. In this paper we shall be concerned with a relatively small subset of this language; it is on-going work to expand the scope of our transformation system. IL is a stack-based language. The fragment we shall be considering has instructions to load literal values onto the stack (ldc), to create new arrays allocated from the heap (newarr ), and to load and store values between local variables or method arguments and the stack (ldloc, ldarg, stloc and starg). The standard arithmetic, boolean and comparison operations (which all have intuitive names) operate solely on stack values. Finally, there are instructions to do nothing, to branch conditionally and unconditionally, and to return from a method (nop, brfalse, brtrue, br and ret). All programs must be verifiable; for our purposes, this means that it should be possible to establish statically what type each item on the stack will be for each position in the in-

struction sequence. This also means that for any particular position, the stack must be the same height each time the flow of control reaches that position. The stack-based nature of IL makes it difficult to formulate even quite simple conditions on IL code. For example, the assignment y := (x + 2) − 1 might be represented by the IL sequence ldloc ldc add ldc sub stloc

x 2 1 y

As a result, a condition that recognised the above assignment would need to be quite long; this problem becomes much worse with more complicated expressions, especially if branches in control-flow occur while values are on the stack. Therefore, the first step we take is to convert from IL to expression IL (EIL). This language abstracts from the IL stack by replacing stack-based computations with expressions, introducing new local variables to hold values that would previously have been stored on the stack. It is the fact that we only deal with verifiable IL that makes this translation possible. The first stage simply introduces one extra local variable for each stack location and replaces each IL instruction with an appropriate assignment; thus the above would become something like: i1 i2 i3 i4 i5 y

:= := := := := :=

x 2 i1 + i2 1 i3 − i4 i5

It is left to the transformations described later to merge these assignments together to give the original assignment y := (x + 2) − 1. EIL is analogous to both the Jimple and Grimp languages from the SOOT framework [29, 30] – the initial translation produces code similar to the threeaddress code of Jimple, and assignment merging leaves us with proper expressions like those of Grimp. The above concrete syntax omits many significant details of EIL; for example, all expressions are typed and arithmetic operators have multiple versions with different overflow handling. We shall return to the structure of EIL in Section 4. The fragment of EIL that we consider here enables us to make many simplifying assumptions that would be invalid for the whole language. In particular, we ignore aliasing problems in this paper.

3.

PATH LOGIC PROGRAMMING

A simple optimisation that can be performed is atomic propagation. In our case, an atomic value is taken to be a constant, a local variable or a parameter that was passed by value to the method. The intention is that if a local variable is assigned an atomic value (and neither the variable nor the value are redefined) then a use of this variable can be replaced by the atomic value. In essence, atomic propagation is just a rewrite rule: S (X ) ⇒

S (V )

Here S (X ) stands for a statement that contains the variable X , and S (V ) is the same statement with X replaced by V . Naturally the validity of this rule requires that X is a variable that holds the value of V whenever statement S (X ) is reached. In the above formulation, X and V are metavariables that will be instantiated to appropriate fragments of the program we wish to transform. We only wish to apply the propagation if V is an atomic value, so that we do not introduce repeated computations. It is easy to implement rewriting in a logic programming language, and doing so makes it particularly easy to keep track of bindings to meta-variables, see e.g. [16, 28]. Indeed, this observation has prompted De Volder to propose the use of Prolog as an appropriate framework for programming transformations [13]. Here we go one step further, and extend Prolog with new primitives that help us to express the side conditions that are found in transformations such as atomic propagation. The Prolog program will be interpreted relative to the flow graph of the object program that is being transformed. The new primitive predicate all Q (N , M ) holds true if N and M are nodes in the flow graph, and all paths from N to M are of the form specified by the pattern Q. Furthermore, there should be at least one path that satisfies the pattern Q (we shall justify this slightly non-intuitive condition later). Such a pattern is a regular expression, whose alphabet consists of logic programming goals. To illustrate, let us return to atomic propagation. The predicate propagate(N , X , V ) holds true if this transformation is applicable at node N in the flow graph, with appropriate bindings for X and V . For example, when this predicate is evaluated relative to the flow graph in Figure 1, all the possible bindings are N := loop, X := y, V := a and N := exit, X := y, V := a The definition of propagate is in terms of the new primitive all: propagate(N , X , V ) :− all ( { }∗ ; { ′ set(X , V ), local (X ), atomic(V ) }; { not(′ def (X )), not(′ def (V )) }∗ ; { ′ use(X ) } ) (entry, N ). This definition says that all paths from program entry to node N should satisfy a particular pattern. A path is a

It may seem rather unnatural to represent paths as sequences of edges rather than sequences of nodes, given that patterns will usually examine the target node of an edge rather than the edge itself. However, using edges rather than nodes give us slightly more power – in particular, it allows us to specify that a path goes down the “then” or “else” branches of an if statement. Although thus far we have not made use of this extra power, we did not wish to rule out the possibility for the future.

3.1 Syntax and Semantics

Figure 1: An example flow graph sequence of edges in the flow graph. The path pattern is a regular expression, and in this example it consists of four components: • A path first starts with zero or more edges that we do not particularly care about: this is indicated by { }∗. As we shall see shortly, the interpretation of { } is a predicate that holds true always. • Next we encounter an edge whose target node is an assignment of the form X := V where X is a local variable and V is atomic, so it is worth attempting to do propagation. • Next we have zero or more edges to nodes that do not re-define X or the value of V . • Finally, we reach an edge pointing to a use of the variable X . This pattern should be satisfied for one particular binding of X and V , on all paths from entry to N . The fragments between curly brackets are ordinary logic programming goals, except for the use of the tick mark (′ ) in front of some predicates. Such predicates are properties of edges. For example, def (X , E ) is a predicate that takes two arguments: a variable X and an edge E , and it holds true when the edge points at a node where X is assigned. Similarly, use(X , E ) is true when the target of E is a node that is labelled with a statement that makes use of X . When we place a tick mark in front of a predicate inside a path pattern, the current edge is added as a final parameter when the predicate is called. We now return to our requirement that there should be at least one path between the nodes N and M for the predicate all Q (N , M ) to hold. Suppose we did not insist on this restriction, and we had some node N to which there did not exist a path from the entry node. Then propagate(N , X , V ) should succeed for any value of X and V , which would lead to a nonsensical situation when we tried to apply the transformation. We could of course specifically add a check to the definition of propagate to avoid this, but this would be required for many other predicates, and we have not found any where our requirement for all is a hindrance. There are also pragmatic reasons for this requirement – the implementation of our primitives demands it [12].

Figure 2 summarises the syntax of our extended version of Prolog. There are two new forms of predicate, namely all and exists. Each of these takes a path pattern and two terms. Both terms are meant to designate nodes, say A and B . The predicate all P (A, B ) holds true if all paths from A to B are of the form indicated by P , and there exists at least one path of that form. The predicate exists P (A, B ) simply requires that there exists one path of this form. A pattern is a regular expression whose alphabet is given by temporal goals – the operator ; represents sequential composition, + represents choice, ∗ is zero or more occurrences and ǫ an empty path. A temporal goal is a list of temporal predicates, enclosed in curly brackets. A temporal predicate is either an ordinary predicate (like atomic in the example we just examined), or a ticked predicate (like use). We can think of these patterns in the usual way as automata, where the edges are labelled with temporal goals. In turn, a temporal goal is interpreted as a property of an edge in the flow graph. The pattern {p0 , p1 , . . . , pk−1 } holds at edge e if each of its constituents holds at edge e. To check whether a ticked predicate holds at e, we simply add e as a parameter to the given predicate. Non-ticked predicates ignore e. We shall write g[e] for the interpretation of a temporal goal g at edge e. We can now be more precise about the meaning of exists: exists Q (S , T ) means ∃[e0 , e1 , . . . , en−1 ] : S → T : ∃[g0 , g1 , . . . , gn−1 ] ∈ Q : ∀i : 0 ≤ i < n : gi [ei ] In words, there exists a path in the flow graph from S to T , and a sequence of goals in the pattern (which leads from an initial state to a final state in the automaton Q) such that each goal holds at the corresponding edge. Universal path patterns are similarly defined, except that we require that at least one path satisfies the given pattern. To wit, all Q (S , T ) means exists Q (S , T ) ∧ ( ∀[e0 , e1 , . . . , en−1 ] : S → T : ∃[g0 , g1 , . . . , gn−1 ] ∈ Q : ∀i : 0 ≤ i < n : gi [ei ] ) In words, there exists a path between S and T of the desired form, and additionally all other paths between S and T are of this form too.

predicate

::=

pattern

::=

tpred

all pattern (term, term) | exists pattern (term, term) | predsym(term, . . . , term) | not(predicate, . . . , predicate) | | | |

{tpred , . . . , tpred } pattern ; pattern pattern + pattern pattern∗ ǫ

|

predicate ′ predsym(term, . . . , term)

::=

Figure 2: Syntax of path logic programming At this point, it is worth mentioning that our proposal to add temporal features to Prolog is by no means a new idea [25]. The application of such features to the specification of program transformations does however appear to be novel.

4.

TRANSFORMING EIL GRAPHS

4.1 Logic terms for EIL As we remarked earlier, the abstract syntax of EIL carries quite a lot of detailed information about expressions. This is reflected in the representation of these expressions as logic terms; thus, the integer literal 5 becomes the logic term expr type(ldc(int(true, b32), 5), int(true, b32)) The expr type constructor reflects the fact that all expressions are typed – its first parameter is the expression and the second the type. The type int(true, b32) is a 32-bit signed integer (the true would become false if we wanted an unsigned one). To construct a constant literal, the constructor ldc is used – it takes a type parameter, which is redundant but simplifies the processing of EIL in other parts of our transformation system, and the literal value. For a slightly more complicated example, the expression x + 5 (where x is a local variable) is represented by expr type(applyatom(add (false, true), expr type(localvar (sname(“x ”)), int(true, b32)), expr type(ldc(int(true, b32), 5), int(true, b32))), int(true, b32)) The term localvar (sname(“x ”)) refers to the local variable x – the seemingly redundant constructor sname reflects the fact that it is also possible to use a different constructor to refer to local variables by their position in the method’s declaration list, although this facility is not currently used. The constructor applyatom exists to simplify the relationship between IL and EIL – the term add (false, true) directly corresponds to the IL instruction add , which adds the top two items on the stack as signed values without overflow. Thus, the meaning of applyatom can be summarised as: “apply the IL instruction in the first parameter to the rest of the parameters, as if they were on the stack”.

Finally, it remains to explain how EIL instructions are defined. It is these that shall be used to label the edges and nodes of our flow graphs. An instruction is either an expression, a branch or a return statement, combined with a list of labels for that statement using the constructor instr label . For example, the following defines a conditional branch to the label target: instr label (“cond ” : nil , branch(cond (. . . ), “target”)) Note that we borrow the notation for lists from functional programming, writing X : Xs in lieu of [X |Xs]. If the current instruction is an expression, then exp enclosing an expression would be used in place of branch, and similarly return is used in the case of a return statement. Other EIL constructors shall be introduced as we encounter them.

4.2 Defining conditions on EIL terms The nodes of the graph are labelled with the logic term corresponding to the EIL instruction at that node. In addition, as described earlier, each edge is labelled with the term of the EIL instruction at the node that the edge points to; it is these labels that are used to solve the existential and universal queries (we anticipate that in future versions of the system, more complex analysis information will be stored at nodes and edges, and that the information will differ between nodes and edges). Our logic language provides primitives to access the relevant label given a node or an edge – @elabel (E , I ) holds if I is the instruction at edge E , and @vlabel (V , I ) holds if I is the instruction at node V (we use a convention of giving primitives names beginning with @). Thus, we can define the set predicate used in Section 3 as follows: set(X , A, E ) :− @elabel (E , instr label ( , exp(expr type(assign(X , A), T )))). Here and elsewhere, we adopt the Prolog convention that singleton variables are named by an identifier that starts with an underscore. It is then straightforward to define def in terms of set: def (X , E ) :−

set(X , A, E )

We also need to define use. This is based on the predicate occurs(R, X ), which checks whether X occurs in R (by the obvious recursive traversal). In defining use(X , E ), we want to distinguish uses of X from definitions of X , whilst still finding the uses of the variable x in expressions such as a[x ] := 5 and x := x + 1: use(X , E )

:− @elabel (E , S ), occurs(S , X ), not(def (X , E )).

use(X , E )

:− set( L, R, E ), occurs(R, X ).

4.3 Modifying the graph Although the logic language we have described makes it convenient to define side conditions for program transformations, it would be rather difficult to use this language to actually apply these transformations, since that would require the program flow graph to be represented as a logic

term. The approach we take is that a successful logic query should also bind its parameter to a list of symbolic “actions” which define a correct transformation on the flow graph. A high-level strategy language is responsible for directing in what order logic queries should be tried and for applying the resulting transformations. The strategy language is similar to those found in the literature on rewriting [5, 31], and we shall not discuss it further here. An action is just a term, which can be either of the form replace vertex (V , W ) or new local (T , N ). The former replaces the vertex V with the vertex W , while the latter introduces a new local variable named N of type T . Thus, the overall propagation rewrite can be defined as follows: propagate rewrite(replace vertex (N , M ) : nil ) :− propagate(N , X , V ), build (N , V , X , M ). The predicate build (N , V , X , M ) creates a new vertex M , by copying the old vertex N , replacing uses of X with V : build (N , V , X , M ) :− @vlabel (N , Old ), subst(V , X , Old , New ), listof (E , source(N , E ), Es), @new vertex (New , Es, M ). We have already discussed the primitive @vlabel . The predicate subst(X , V , Old , New ) constructs the term New from Old , replacing uses of X with V . As with use, it is defined so as not to apply this to definitions of X – if we are replacing x with 0 in x := x + 1 we want to end up with x := 0 + 1, not 0 := 0 + 1. New vertices are constructed by using @new vertex . This primitive takes a vertex label and a list of outgoing edges and binds the new vertex to its final parameter. In this case, we use the same list of edges as the old vertex, since all we wish to do is to replace the label. The predicate source(N , E ) is true if the vertex N is the source of the edge E , whilst the listof predicate is the standard Prolog predicate which takes three parameters: a term T , a predicate involving the free variables of T , and a third parameter which will be bound to a list of all instantiations of T that solve the predicate. Thus the overall effect of listof (E , source(N , E ), Es) is to bind Es to the outgoing edges from node N , as required.

5.

OPTIMISATIONS

The remainder of this paper consists of examples, and these are intended to evaluate the design sketched above. We shall first examine a number of typical compiler optimisations. In the present context, these transformations are used to clean up code that results either from the translation of IL to EIL, or from our obfuscations. In particular, we examine dead assignment elimination, a form of constant folding, and dead branch elimination. These were chosen because they are representative; this list is however not exhaustive, and it is essential that they are applied in conjunction with other transformations. Before we embark on detailed coding, therefore, we summarise one of these other transformations that is particularly important. As we have discussed in Section 2, the transformation of IL to EIL creates many extra local variables. For

example, the IL instructions: ldc stloc

5 x

are translated to something of the form: i x

:= :=

5 i

It can be easily seen that the above assignments can be combined to give: x

:=

5

Newly created local variables are generally only used once – the exception is when the original IL code has the control flow split at a program point where there are values on the stack. If a local variable is used only once, an assignment to it can be propagated regardless of whether the value being assigned is atomic or not (so long as the value is not changed in the meantime). The transformation to carry out this propagation follows a similar pattern to the atomic propagation defined in Section 3 and we shall not spell out the details.

5.1 Dead Assignment Elimination After propagation of atomic values and the removal of unique variable uses as described above, there will be places in the code where variables are assigned a value which is not used afterwards. Such assignments can be replaced by nop – a subsequent transformation can remove nops completely. It is more convenient to do this in two phases because there are many transformations which remove code, and in each we would otherwise have to be careful to preserve the labels of the removed instructions in case any jumps to those labels existed. Let us look at the conditions needed for this transformation: pure assignment(X , V , N ) :− exists ( { }∗ ; { ′ set(X , V ), pure(V ), local (X ) } ) (entry, N ). It should be noted that although we have used an exists query here, we did not really need to – in this case we are looking for a single node that satisfies certain conditions, not an entire path. However, it turns out to be convenient to express the query in this way because predicates such as use are already defined with respect to edges. The first part of the condition states we need an expression of the form X := V at node N . We also require that X is local to the method and that V is “pure”, i.e. it has no side effects (each of these conditions can be defined as standard Prolog predicates). For the second part of the condition, we require that after node N , X is no longer used (except at node N itself) until another definition of X , or the exit of the method body, is reached.

We first define a predicate unused other than at to capture the first half of this requirement. Note that it is permissible to use negation, since X will already be bound to a ground term. unused other than at(N , X , E ) :− isnode(N , E ). unused other than at(N , X , E ) :− not(use(X , E )). We can now define the path condition: unused from(X , N ) :− all ( { ′ unused other than at(N , X ) }∗ ; (ǫ + { ′ def (X ) }; { }∗) ) (N , exit). In other words, all paths from node N to the exit do not use X other than at node N , unless they first redefine X . For this transformation, it is not appropriate to use build to produce a new vertex, since the entire assignment needs to be replaced by a nop. Instead, the vertex is created manually: dead code(replace vertex (N , NopVert) : nil ) :− pure assignment(X , V , N ), unused from(X , N ), listof (E , source(N , E ), Es), @vlabel (N , instr label (L, ), @new vertex ( instr label ( L, exp(expr type(applyatom(nop), void ))), Es, NopVert).

5.2 Evaluation After the elimination of unique uses of a variable and atomic value propagation have been performed, we are often left with expressions involving only constants, which could be evaluated. For example: i x j y z

:= := := := :=

20 i 32 j x +y

would be transformed to: z

:=

20 + 32

We would like the right hand side of the assignment to be replaced by 52. So, we will need a predicate that tries to evaluate any arithmetic operations that have constant expressions – eval (I , J ) will try to evaluate I and bind the resulting integer to J . The base case states that the value of a constant is just that constant: eval (applyatom(ldc(int(true, b32), N )), N ).

Here is another clause of eval , for evaluating an addition: eval (applyatom(add (Ov , S ), L, R), V ) :− eval (L, V1 ), eval (R, V2 ), V is V1 + V2 . It should be noted that we are not inspecting the overflow and sign bits (Ov and S ) here, while the semantics dictate that we should. Future versions of our implementation will employ reflection to ensure that the semantics of compiletime evaluation is exactly the same as run-time evaluation. Using eval on an EIL expression leaves us with an integer (if it succeeds), which we then convert back to an EIL expression: value(V , expr type(applyatom(ldc(int(true, b32), V )), int(true, b32)). To apply evaluation to a node, we look for a non-atomic expression at that node (if we allowed atomic expressions to be evaluated then constant values would be repeatedly transformed into themselves!), and replace the expression by its value if eval succeeds: evaluate(replace vertex (N , M ) : nil ) :− exists ( { }∗ ; { ′ use(F ), F = expr type( , int(true, b32)), not(atomic(F )) } ) (entry, N ), eval (F , V ), value(V , CE ), build (N , CE , F , M ). As before, we use build to replace the original expression F with CE .

5.3 Dead Branch Elimination One of our obfuscations (see Section 6.2.2) adds conditional branches to the program. After evaluation of the conditions in such branches (or elsewhere), we may find we have a constant condition that is therefore redundant. In keeping with the specification of IL, we assume that “true” is defined to be a non-zero integer. First, we need to find a suitable conditional branch. We specify a predicate that will test whether a vertex has a conditional branch instruction: cond branch(Cond , Labels, E ) :− @elabel (E , instr label (Labels, branch(cond (Cond ), ))). To use this to find a true branch we look for constant conditions whose value is non-zero, and then replace the branch vertex with a nop pointing to the “true” branch. As with dead assignment elimination, this is simpler overall than just replacing the branch statement with the “true”

vertex. dead branch(replace vertex (BranchVert, NopVert) : nil ) :− exists( { }∗ ; { ′ cond branch( expr type( applyatom(ldc(int(true, b32), N )), int(true, b32)), Labels), not(N = 0) }) (entry, BranchVert), listof (Edge, source(BranchVert, Edge), TrueEdge : FalseEdge : nil ), @new vertex ( instr label ( Labels, exp(expr type(applyatom(nop), void ))), TrueEdge : nil , NopVert). We use listof as discussed earlier to obtain a list of the outgoing edges from BranchVert. Our graph representation guarantees that the edges will be ordered with the “true” branch first. For a false branch, we repeat the same definition, but require that N equals 0 in the condition, and replace TrueEdge by FalseEdge in the construction of NopVert.

• Any expressions of the form i := E (a definition of i) are transformed to j := f (E ). • Any uses of i (i.e. not a definition of i) are replaced by f −1 (j ).

6.1.1 An example Let us take f (i) = 2i and f −1 (j ) = j /2 . The program i := 1 s := 0 loop : s := s + i i := i + 1 brif (i < 15) loop should be transformed to j := 2 s := 0 loop : s := s + (j /2) j := 2 ∗ ((j /2) + 1) brif ((j /2) < 15) loop We could also define transformations to conduct algebraic simplification, which would turn the above into: j := 2 s := 0 loop : s := s + (j /2) j := j + 2 brif (j < 30) loop

6.1.2 Implementing the transformation

6.

OBFUSCATIONS

It is relatively easily to decompile IL code back to a highlevel language such as C#. Therefore, software distributors who wish to avoid their source code being revealed, for example to prevent tampering or to protect a secret algorithm, need to take steps to make this harder. One possibility is to obfuscate the IL code that they distribute. Although preventing decompilation completely is likely to be impossible, especially in the case of verifiable IL code, applying transformations that make the source code that results from decompilation difficult to understand might be an acceptable alternative. In this section, we show how path logic programming can be used to give formal, executable specifications to two representative examples from Collberg’s taxonomy of obfuscating transformations [8]: variable transformation and array splitting.

6.1 Variable Transformations The idea of variable transformation is to pick a variable i which is local to a method and to replace all occurrences of i in that method with a new variable j , which is related to i. For this, we need a function f which is bijective with domain Z (or some subset of Z if the potential values of i are known) and range Z (again, we could have some subset). We also will need f −1 – the inverse of f (which exists as f is bijective). To transform i, we need to perform two types of replacements:

The initial phase of the transformation is to find a suitable variable. All we require is that the variable is assigned to somewhere and that it is local. After we choose our variable (OldVar ), we generate a fresh variable name using @fresh name, which takes a type as a parameter so that the name generated can reflect this type. As well as producing an action that adds this new local variable to the method, the following introduce local predicate returns the old and new local variables – these are needed for the next phase, which is to actually transform the uses and definitions of the old variable. introduce local (OldVar , NewVar , new local (int(true, b32), NewVarName) : nil ) :− exists ( { }∗ ; { ′ set(OldVar , V ), OldVar = expr type(localvar ( ), int(true, b32)) } ) (entry, OldVarVert), @fresh name(int(true, b32), NewVarName), NewVar = expr type(localvar ( sname(NewVarName)), int(true, b32)).

Once the new local has been introduced, we can carry out the obfuscation by exhaustively replacing uses and definitions of the old variable as appropriate. We first specify predicates which build representations of the functions f and f −1 outlined in Section 6.1.1. The predicate use fn(A, B ) binds B to a representation of f −1 (A): use fn (A, expr type( applyatom(cdiv (true), A, expr type(applyatom( ldc(int(true, b32), 2)), int(true, b32))), int(true, b32)) ). Similarly, we can define assign fn(C , D) which binds D to a representation of f (C ). It is now simple to replace uses of OldVar : replace use(OldVar , NewVar , replace vertex (OldVert, NewVert) : nil ) :− exists ( { }∗ ; { ′ use(OldVar ) } ) (entry, OldVert), use fn(NewVar , NewUse), build (OldVert, NewUse, OldVar , NewVert). Similarly, we can replace assignments to OldVar .

6.2 Array Transformations

place the elements in two (or more) new arrays. To do this, it is necessary to define functions which determine where the elements of the original array are mapped to (i.e. which array and the position in that array). Let us look at a simple example. Suppose that we have an array A of size n and we want to split it into two new arrays B1 and B2 . We want B1 and B2 to have the same size (possibly differing by one element), so let B1 have size ((n + 1) div 2) and B2 have size (n div 2). The relationship between A, B1 and B2 is given by the following rule:



A[i] =

B1 [i div 2] B2 [(i − 1) div 2]

if i is even if i is odd

The program: int[] a := new int[20] int i := 0 loop : a[i] := i i := i + 1 brif (i < 20) loop should be transformed to: int[] b1 := new int[10] int[] b2 := new int[10] int i := 0 loop : if (i%2 == 0) b1[i/2] := i else b2[(i − 1)/2] := i i := i + 1 brif (i < 20) loop (The if . . . else . . . is not strictly part of EIL, but its implementation in terms of branches is obvious.)

If we have a method which uses one or more arrays, we can rearrange some or all of those arrays, for example by permuting the elements of one, or by merging two of the arrays into one, or by splitting one into two separate arrays. In essence, array transformations are just a type of variable transformation. The key point is that each access (use or definition) to an element of one of the original arrays can be translated into an access to an element of one of the transformed arrays. If one of the original arrays is used in its entirety (for example by being assigned to another array variable), this would need to be replaced with code to dynamically apply the transformation to the entire array, which could have a major impact on the runtime performance of the program. Therefore, we avoid applying array transformations in situations where this would be necessary. We consider that an array-typed local variable can have an array transformation applied to it if every path through the method reaches a particular initialisation of that variable to a newly created array (using the IL instruction newarr ), and that all occurrences of that array variable which can be reached from that initialisation are accesses to an element of the array, rather than a use or definition of the array variable itself.

To ensure that there are no index clashes with the new arrays, we require that f1 and f2 are injective. We should also make sure that the elements of A are fairly distributed between B1 and B2 . This means that c should partition [0..n) into (approximately) equal pieces.

6.2.1 Array splitting

6.2.3 Finding a suitable array

The obfuscation that we are going to specify is an array split. The idea of array splitting is to take an array and

Now, we show how to implement an array split using the transformation outlined in the example in Section 6.2.1. Let

6.2.2 Specifying the transformation In general, suppose we have an array A with size n. Then to define an array split of A into two other arrays, we need three functions c, f1 and f2 and two new arrays B1 and B2 of sizes m1 and m2 respectively (where m1 + m2 n). The types of the functions are as follows:

>

c : [0..n) f1 : [0..n) f2 : [0..n)

→ → →

{True, False} [0..m1 ) [0..m2 )

The relationship between A, B1 and B2 is given by the following rule:



A[i] =

B1 [f1 (i)] if c(i) B2 [f2 (i)] otherwise

us look at the conditions necessary for the transformation. First, we need to find a place where an array is initialised: array initialise(InitVert, OldArray, Type, OldSizeExpr ) :− exists ( { }∗ ; { ′ set(OldArray, expr type( applyatom( newarr (T ), expr type(OldSizeExpr , int(true, b32))), array(Type))), OldArray = expr type(localvar ( ), array(Type))), ctype from type spec(Type, T ) } ) (entry, InitVert). This condition states that at InitVert, we have an instruction of the form: OldArray := newarr (Type)[Size] where OldArray is a local variable of array type. The standard representation of types in IL that we have being using so far is known as a ctype; however the type parameter to newarr takes a slightly different form known as a type spec. We use the predicate ctype from type spec to compare the two and thus make sure that we are dealing with a well-typed array initialisation. The next step is to check that every path through the method reaches the initialisation, and after that point the array variable is not used except to access an array element. The predicate unindexed (OldArray, E ) holds if OldArray is used without an index at the node pointed to by edge E . This gives us the following condition: ok to transform(InitVert, OldArray) :− all ( { }∗ ; { ′ isnode(InitVert) }; { not(′ unindexed (OldArray)) }∗ ) (entry, exit).

6.2.4 Initialising the new arrays Next, we need to create two new arrays with the correct sizes. If the size of the old array is n, the sizes of the new arrays should be (n + 1)/2 and n/2. We define predicates div two and plus one div two that will construct the appropriate expressions, given the original array size. Since the expression that computes the original array size might have side effects or be quite complicated, we first introduce a new local variable to hold the value of this expression so we do not repeat its computation. Thus, we need to

construct three new vertices to replace InitVert – one to initialise the local variable, and two to initialise the new arrays. We omit the details – the only minor difficulty is that because @new vertex requires a list of outgoing edges, and constructing new edges requires the target vertex, we must construct the vertices in the reverse of the order they will appear in the graph.

6.2.5 Replacing the old array The next step is to exhaustively replace occurrences of the old array. For each occurrence that follows the newly inserted initialisations, we need to insert a dynamic test on the index that the old array was accessed with, and access one of the two new arrays depending on the result of that dynamic test. Finding occurrences is straightforward – we just define a predicate array occurs(A, I , E ) which looks for any occurrence of the form A[I ] at E , and then search for nodes occurring after the initialisation of the second new array (whose vertex is bound to SecondInitVert): find old array(OldArray, SecondInitVert, Index , OccursVert) :− exists ( { }∗ ; { ′ array occurs(OldArray, Index ) } ) (SecondInitVert, OccursVert). Using the newly bound value for Index , we can then construct the necessary nodes with which to replace OccursVert – again, we omit details.

7. RELATED WORK This paper is a contribution to the extensive literature on specifying program analyses and transformations in a declarative style. In this section, we shall survey only the most closely related work, which directly prompted the results presented here. The APTS system of Paige [26] has been a major source of inspiration for the work presented here. In APTS, program transformations are expressed as rewrite rules, with side conditions expressed as boolean functions on the abstract syntax tree, and data obtained by program analyses. These functions are in some ways similar to those presented here, but we have gone a little further in embedding them in a variant of Prolog. By contrast, in APTS the analyses are coded by hand, and this is the norm in similar systems that have been constructed in the high-performance computing community, such as MT1 [4], which is a tool for restructuring Fortran programs. Another difference is that both these systems transform the tree structure rather than the flow graph, as we do. We learnt about the idea of taking graph rewriting as the basis of a transformation toolkit from the work of Assmann [3] and of Whitfield and Soffa [32]. Assmann’s work is based on a very general notion of graph rewriting, and it is of a highly declarative nature. It is a little difficult, however, to express properties of program paths. Whitfield and Soffa’s system does allow the expression of such properties,

through a number of primitives that are all specialisations of our predicate all P (S , T ). Our main contribution relative to their work is that generalisation, and its embedding in Prolog. The use of path patterns (for examining a graph structure) that contain logical variables was borrowed from the literature on querying semi-structured data [6]. There, only existential path patterns are considered. So relative to that work, our contribution is to explore the utility of universal path patterns as well. Liu and Yu [20] have derived algorithms for solving regular path queries using both predicate logic and language inclusion, although their patterns do not (yet) support free variables, which for our purposes are essential for applying transformations. The case for using Prolog as a vehicle for expressing program analyses was forcefully made in [11]. The same research team went on to embed model checkers in Prolog [10], and indeed that was our inspiration for implementing the all and exists primitives through tabling. Kris De Volder’s work taught us about the use of Prolog for applying program transformations [13].

8.

DISCUSSION

We have presented a modest extension of Prolog, namely path logic programming. The extension is modest both in syntactic and semantic terms: we only introduced two new forms of predicate, and it is easy to compile these to standard Prolog primitives (the resulting program uses tabling to conduct a depth-first search of the product automaton of the determinised pattern and the flow graph [12]). We could have chosen to give our specifications in terms of modal logic instead of regular expressions, and indeed such modal logic specifications were the starting point of this line of research [19]. We believe that regular expressions are a little more convenient, but this may be due to familiarity, and a formal comparison of the expressivity of the alternate styles would be valuable.

8.1 Shortcomings There are a number of problems that need to be overcome before the ideas presented here can be applied in practice: • Our transformations need to take account of aliasing. We have recently updated our implementation to add Prolog primitives which conduct an alias analysis on demand – an alternative approach would be to annotate the graph with alias information. • For efficiency, it would be desirable that the Prolog interpreter proceeds incrementally, so that work done before a transformation can be re-used afterwards. Our current implementation is oblivious of previous work, and therefore very inefficient. We are currently developing an algorithm that should address this concern and hope to be able to report results soon. • To deal with mutually recursive procedures, it would be more accurate to describe paths in the program through a context-free grammar, instead of a traditional flow graph [23]. We are now engaged in the implementation of a system that addresses all three of these problems. The main idea is to

compile logical goals of the form all Q (S , T ) to a standard meet-over-all-paths analysis [1]. This allows a smooth integration with standard data flow analyses, and also a re-use of previous work on incremental computation of such analyses. It does however require us to make the step from standard logic programming to Prolog with inequality constraints. As with all programming languages, path logic programming would greatly benefit by a static typing discipline. In our experience, most programming errors come from building up syntactically wrong EIL terms – this could easily be caught with a type system such as that of Mercury [27]. In fact, as Mercury already has a .NET implementation [14], there is an argument for integrating our new features into that language rather than standard Prolog. Our current implementation language is Standard ML. In transformations that involve scope information, it is tricky to keep track of the exact binding information, and the possibility of variable capture. The framework should provide support for handling these difficulties automatically. This difficulty has been solved by higher-order logic programming [24], and we hope to integrate those ideas with our own work at a later stage. Transformation side-conditions can quickly become quite complex, and the readability of the resulting description in Prolog is something of a concern. We hope that judicious modularisation of the Prolog programs and appropriate use of our strategy language to separate different transformations addresses this issue, but more experience will be required to determine whether this actually works in practice with large sets of complex transformations.

8.2 Other applications Path logic programming might have other applications beyond those presented here. For example, in aspect-oriented programming one needs to specify sets of points in the dynamic call graph [18]. Current language proposals for the specification of such pointcuts are somewhat ad hoc, and path logic programming could provide a principled alternative. Furthermore, we could then use the transformation technology presented here to achieve the compile-time improvement of aspect-oriented programs [21]. Another possible application of path logic programming is static detection of bugs: the path queries could be used to look for suspect patterns in code. Indeed, the nature of the patterns in Dawson Engler’s work on bug detection [7] is very similar to those presented here. Again we are not the first to note that a variant of logic programming would be well-suited to this application area: Roger Crew designed the language ASTLOG for inspection of annotated abstract syntax trees [9]. Our suggestion is that his work be extended to the inspection of program paths, thus making the connection with the extensive literature on software model checking.

8.3 Acknowledgements We would like to thank our colleagues in the Programming Tools Research Group at Oxford for many enjoyable discussions on the topic of this paper. We would also like to thank David Lacey, Krzysztof Apt, Hidehiko Masuahara, Paul Kwiatkowski and the three anonymous PPDP reviewers for their many helpful comments. Stephen Drape and Ganesh Sittampalam are supported by a research grant from Microsoft Research.

9.

REFERENCES

[1] Aho, A., Sethi, R., and Ullman, J. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1985. [2] Archer, T. Inside C#. Microsoft Press, 2001. [3] Assmann, U. How to specify program analysis and transformation with graph rewrite systems. In Compiler Construction 1996 (1996), P. Fritzson, Ed., vol. 1060 of Lecture Notes in Computer Science, pp. 372–395. [4] Bik, A. J. C., Brinkhaus, P. J., Knijnenburg, P. M. W., and Wijshoff, H. A. G. Transformation mechanisms in MT1. Technical report, Leiden Institute of Advanced Computer Science, 1998. ´ , P., Kirchner, C., Kirchner, H., [5] Borovansky and Ringeissen, C. Rewriting with strategies in ELAN: a functional semantics. International Journal of Foundations of Computer Science 12, 1 (2001), 69–98. Also available as Technical Report A01-R-388, LORIA, Nancy (France). [6] Buneman, P., Fernandez, M., and Suciu, D. UnQL: A query language and algebra for semistructured data based on structural recursion. VLDB Journal 9, 1 (2000), 76–110. [7] Chou, A., Chelf, B., Hallam, S., and Engler, D. Checking system rules using system-specific, programmer-written compiler extensions. In USENIX Symposium on Operating System Design and Implementation (2000). [8] Collberg, C., Thomborson, C., and Low, D. A taxonomy of obfuscating transformations. Tech. Rep. 148, Department of Computer Science, University of Auckland, July 1997. Available from URL: http://www.cs.auckland.ac.nz/~collberg/ Research/Publications/CollbergTh% omborsonLow97a/index.html. [9] Crew, R. F. ASTLOG: a language for examining abstract syntax trees. In Proceedings of the USENIX Conference on Domain-Specific Languages (1997), C. Ramming, Ed., pp. 229–242. [10] Cui, B., Dong, Y., Du, X., Kumar, K. N., Ramakrishnan, C. R., Ramakrishnan, I. V., Roychoudhury, A., Smolka, S. A., and Warren, D. S. Logic programming and model checking. In PLILP/ALP (1998), pp. 1–20. [11] Dawson, S., Ramakrishnan, C. R., and Warren, D. S. Practical program analysis using general purpose logic programming systems - a case study. In SIGPLAN Conference on Programming Language Design and Implementation (1996), pp. 117–126. [12] de Moor, O., Lacey, D., and Van Wyk, E. Universal regular path queries. Higher-order and symbolic computation (2002), to appear. [13] de Volder, K. Type-oriented logic meta-programming. PhD dissertation, Vrije Universiteit Brussels, 1998. [14] Dowd, T., Henderson, F., and Ross, P. Compiling Mercury to the .NET common language runtime. In BABEL ’01: First International Workshop on Multi-Language Infrastructure and Interoperability (To appear, 2002), vol. 59 of Electronic Notes in Theoretical Computer Science.

[15] ECMA. Standard ECMA-335: Common language infrastructure (CLI). Available from http://www.ecma.ch/ecma1/STAND/ecma-335.htm. [16] Felty, A. A logic programming approach to implementing higher-order term rewriting. In Proceedings of the January 1991 Workshop on Extensions to Logic Programming (1992), L.-H. Eriksson, L. Halln¨ as, and P. Schroeder-Heister, Eds., vol. 596, Springer-Verlag Lecture Notes in Artificial Intelligence, pp. 135–161. [17] Gunnerson, E. A programmer’s introduction to C#. APress, 2001. [18] Kiczales, G., Lamping, J., Menhdhekar, A., Maeda, C., Lopes, C., Loingtier, J.-M., and Irwin, J. Aspect-oriented programming. In Proceedings European Conference on Object-Oriented Programming, M. Ak¸sit and S. Matsuoka, Eds., vol. 1241. Springer-Verlag, Berlin, Heidelberg, and New York, 1997, pp. 220–242. [19] Lacey, D., and de Moor, O. Imperative program transformation by rewriting. In Proceedings of the 10th International Conference on Compiler Construction (2001), R. Wilhelm, Ed., vol. 2027 of LNCS, Springer Verlag, pp. 52–68. [20] Liu, Y. A., and Yu, F. Solving regular path queries. In Proceedings of the 6th International Conference on Mathematics of Program Construction (2002), pp. 195–208. [21] Masuhara, H., Kiczales, G., and Dutchyn, C. Compilation semantics of aspect-oriented programs. In Proceedings of the AOSD workshop on Foundations of Aspect-oriented Languages (2002), vol. TR-02-06 of Technical report, Department of Computer Science, Iowa State University, pp. 17–25. [22] Meijer, E., and Gough, J. A technical overview of the commmon language infrastructure. Available from URL: http://research.microsoft.com/~emeijer/ Papers/CLR.pdf. [23] Melski, D., and Reps, T. Interconvertibility of a class of set constraints and context-free language reachability. Theoretical Computer Science, 248 (2000), 29–98. [24] Nadathur, G., and Miller, D. Higher-order logic programming. In Handbook of logics for artificial intelligence and logic programming, D. M. Gabbay, C. Hogger, and J. A. Robinson, Eds., vol. 5. Clarendon Press, 1998, pp. 499–590. [25] Orgun, M. A., and Ma, W. An overview of temporal and modal logic programming. In Proceedings of ICTL’94: The 1st International Conference on Temporal Logic (Berlin Heidelberg, 1994), D. M. Gabbay and H. J. Ohlbach, Eds., Springer-Verlag, pp. 445–479. [26] Paige, R. Viewing a program transformation system at work. In Proceedings of the Sixth International Symposium on Programming Language Implementation and Logic Programming (1994), M. Hermenegildo and J. Penjam, Eds., Springer Verlag, pp. 5–24. [27] Somogyi, Z., Henderson, F., and Conway, T. Mercury: an efficient purely declarative logic programming language. In Proceedings of the

[28] [29]

[30]

[31]

[32]

Australian Computer Science Conference (1995), pp. 499–512. Spivey, M. Logic Programming for Programmers. Prentice Hall, 1996. Vallee-Rai, R., Gagnon, E., Hendren, L. J., Lam, P., Pominville, P., and Sundaresan, V. Optimizing Java bytecode using the Soot framework: Is it feasible? In Proceedings of the International Conference on Compiler Construction (2000), pp. 18–34. Vallee-Rai, R., Hendren, L., Sundaresan, V., Lam, P., Gagnon, E., and Co, P. Soot – a Java optimization framework. In Proceedings of CASCON 1999 (1999), pp. 125–135. Visser, E. Stratego: A language for program transformation based on rewriting strategies. System description of Stratego 0.5. In Rewriting Techniques and Applications (RTA’01) (May 2001), A. Middeldorp, Ed., no. 2051 in Lecture Notes in Computer Science, Springer-Verlag, pp. 357–362. Whitfield, D. L., and Soffa, M. L. An approach for exploring code improving transformations. ACM Transactions on Programming Languages and Systems 19, 6 (1997), 1053–1084.

Suggest Documents