Proof-Directed De-compilation of Java Bytecode 1 ... - Semantic Scholar

4 downloads 0 Views 236KB Size Report
We present a proof system for the Java bytecode language based on a ..... For a context ?; of JAL0 judgment, it translation ?; in rec is de ned as follows. ?; (z) =.
Proof-Directed De-compilation of Java Bytecode Shin-ya Katsumata

Atsushi Ohori

Research Institute for Mathematical Sciences Kyoto University Sakyo-ku, Kyoto 606-8502, JAPAN E-mail: fsinya,[email protected]

Abstract

We present a proof system for the Java bytecode language based on a Curry-Howard isomorphism for machine code, where an executable code is regarded as a proof of a variant of a sequent calculus of the intuitionistic propositional logic. Di erent from type systems for Java bytecode so far proposed, our proof system not only speci es type consistency but also represents the computation denoted by a bytecode program. This allows us to re-construct a proof of the intuitionistic propositional logic from any (logically consistent) bytecode program. As being a proof of the intuitionistic logic, the re-constructed proof can be translated to an equivalent proof of the natural deduction proof system. This property yields an algorithm to translate a given bytecode program into a lambda term. Moreover, the resulting lambda term is not a trivial encoding of a sequence of machine instructions, but re ects the behavior of the given bytecode program. This process therefore serves as proof-directed de-compilation of the Java bytecode language to a high-level language. We carry out this program for a subset of Java Virtual Machine instructions including most of its features such as jumps, object creation and method invocation. The proof-directed de-compilation algorithm has been implemented, which demonstrates the feasibility of our approach.

1 Introduction The ability to analyze a compiled code before its execution becomes increasingly important due to recently emerging network computing, where pieces of executable code are dynamically exchanged over the Internet and used under the user's own privileges. In such a circumstance, it is a legitimate desire to verify that a foreign code satis es some desired properties before it is executed. This problem has recently attracted attentions of programming language researchers. One notable approach toward veri cation of properties of compiled code is to construct a formal proof of certain desired properties of a code using a theorem prover, and to package the code with its formal proof to form a proof-carrying code [11]. The user of the code can then check the attached proof against the code to ensure that the code satis es the properties. Another approach is to develop a static type system for compiled code [17, 3, 10, 9]. By checking the type consistency of a code, the user can ensure that the code does not cause any run-time type error. Both of them are e ective in verifying certain predetermined safety properties. In many cases, however, the user would like to know the actual behavior of a foreign code for ensuring that the code correctly realizes its expected functionality. Moreover, analysis of exact behavior of a code would open up a new possibility of network computing, where foreign code can be dynamically analyzed and adapted to suit for the user's environment. An executable code { whether it is a bytecode for an abstract machine or a native code for some speci c hardware { is a large sequence of primitive instructions, and it is certainly impossible for human to understand. However this does not mean, at least in principle, that a code is incomprehensible. A properly compiled code of a correct program is a consistent syntactic object that can be interpreted by a machine to perform the function denoted by the original program. This fact suggests that it should be possible to develop a systematic method to extract the logical structure of a code and to present it in a high-level and  The

second author's work was partly supported by the Parallel and Distributed Processing Research Consortium, Japan.

1

human-readable language. The purpose of this paper is to show that the development of such a system is indeed possible. The key to developing a system of code analysis is a Curry-Howard isomorphism for machine code presented in [14]. In this paradigm, a code language corresponds to a variant of the sequent calculus, called the sequential sequent calculus, of the intuitionistic propositional logic. Being a proof system of the logic, it can be translatable to and from other languages corresponding to proof systems of the intuitionistic logic. Compilation corresponds to a proof transformation from the natural deduction into the sequential sequent calculus. It is also shown that the converse is equally possible, leading to the proof-directed de-compilation of machine code. Since the translation is done through a universal language, namely the intuitionistic propositional logic, this approach is applicable to a wide range of code languages and high-level languages. As far as we can de ne a constructive proof system corresponding to a code language, it can be translated to other languages corresponding to the intuitionistic propositional logic such as the typed lambda calculus and A-normal forms [2, 13]. In the present paper, we show that this approach leads to a realistic code analyzer by developing a code analysis framework for the Java bytecode language speci ed in [7], which is the target language of Java [6]. The rationale of this choice is twofold. On one hand, since the Java bytecode language is intended to serve as a universal code language for network computing, its static analysis itself is important. On the other hand, the Java bytecode language provides several modern features such as objects and methods in the bytecode level, and serves as a good touchstone for the feasibility of the logical approach we are advocating. In this paper, we consider a subset of Java Virtual Machine (JVM) instructions including those for primitive operations on integers, stack manipulation, local variable access, conditional and unconditional jumps, object creation and initialization, and method invocation. We believe that this subset is large enough to demonstrate the feasibility of our method. Among these features, manipulation of objects is inherently imperative and requires extra-logical structure. For this reason, we divide our technical development in two stages. We rst de ne a JVM assembly language without objects, which we denote here by JAL0 , and develop a code analysis system for JAL0 . We then extend JAL0 with objects and methods and introduce necessary mechanism for extending the code analysis system to the imperative object manipulation instructions. Based on these results, we have implemented a proof-directed de-compiler for the Java bytecode language supporting most of its features. To illustrate the feasibility of the proof-directed de-compilation we shall present in this paper, let us show an example of de-compilation. Figure 1 shows an actual output of our de-compiler. The source bytecode is the result of compiling the following factorial program written in Java. public static int fact (int n) { int m; for(m = 1; n > 1; --n) m = m * n; return m; }

As seen in this example, our de-compiler correctly recovers the factorial program as a tail recursive function without using any knowledge other than the given bytecode. Without a proper formalism, construction of such a de-compiler would be dicult if not impossible. We shall include a description of the implementation after we present the logical framework for de-compilation. Related Work. Before giving the technical development, we compare the results presented in this paper with related works. The work most relevant to ours is that of Stata and Abadi [17] who presented a type system of a subset of the Java bytecode language including subroutines. This work is further re ned by O'Challahan [12] and Freund and Mitchell [3]. In these approaches, a type system is used to check the consistency of an array of instructions. The result of typecheking is success or failure indicating whether the array of instruction is type consistent or not. In contrast, our approach is to interpret a given code as a constructive proof representing its computation. This allows us to de-compile a code to the lambda term. In addition to the consistency checking of instructions, the type system of Freund and Mitchell [3] contains a mechanism to prevent uninitialized objects from being accessed. This mechanism has the generality that makes it applicable to other formal systems. We adopt this mechanism for type safe manipulation of object creations in our formalism. Our work is also related to a typed assembly language (TAL) of Morrisett et. al. [10, 9]. Their type system is designed to check whether a collection of assembly instructions is type consistency or not, and is 2

.method public static fact(I)I iconst_1 istore_1 goto L12 L5: iload_1 iload_0 imul istore_1 iinc 0 -1 L12: iload_0 iconst_1 if_icmpgt L5 iload_1 ireturn .end method

An input to the de-compiler.

fact(e0) = L12(e0, 1) L12(e0, e1) = if(1 < e0) then L12(e0 - 1, e1 * e0) else return e1

An actual output of our de-compiler.

Figure 1: An example of the proof-directed de-compilation not intended to serve as a logic. Nonetheless, some of our proof rules have the similarity to the corresponding ones in their type systems. In our proof theory, a jump instruction is interpreted as a rule to refer to an existing proof, and is intuitively understood as an edge connecting two proofs in a code program consisting of a graph of basic blocks of instructions. This idea has some similarity to the TAL's treatment of jump instruction; it takes as its argument a pointer to other instructions (i.e. proofs) in the heap area. Our de-compilation performs proof translation from a variant of the sequent calculus to the natural deduction. In the perspective of logic, the relationship between the Gentzen's intuitionistic sequent calculus and the natural deduction has been extensively studied [5, 16, 19, 15]. (See also [4] for an illuminating tutorial on this subject.) Our proof system for bytecode languages has the similarity to the Gentzen's sequent calculus, especially in its treatment of structural rules, and therefore some of the cases in de-compilation algorithm have the similar structure to the corresponding cases in a proof transformation algorithm from the Gentzen's sequent calculus to the natural deduction. There are also a number of works for \reverse engineering" machine code. (See for example proceedings of the working conference on reverse engineering [18].) While our work is based entirely on logical principles and does not use any of those techniques, we believe that some of their techniques would be useful when we want to consider existing programming languages such as C and Java as the target of our de-compilation. Paper Organization. In Section 2, we explain the Curry-Howard isomorphism of machine code and outline the proof-directed de-compilation. Section 3 de nes a proof system for a term language JAL0 , for the subset of Java bytecode without objects, shows the soundness of the proof system with respect to the operational semantics of JAL0 , and develops an algorithm which translates a JAL0 program to the equivalent lambda term. Section 4 extends this formalism with objects and methods. Section 5 describes the implementation of our prototype de-compiler. Section 6 concludes the paper.

2 Logical approach to code analysis In this section, we outline the Curry-Howard isomorphism for machine code presented in [14], on which our proof-directed de-compilation of Java bytecode is based. We let  range over a list of formula representing an assumption set of a logical sequent, and write ;  for the list obtained from  by adding . The basic observation underlying the logical interpretation of 1 machine code is to consider each instruction I as a logical inference step of the form  2   which is similar to a left rule with only one premise in the Gentzen's sequent calculus. Regarding the assumption set as a description of machine memory (stack), an inference rule of this form represents a primitive machine instruction that transforms a memory state 2 to that of 1. The \return" instruction corresponds to an 3

initial sequent of the form ;   : A bytecode language for a functional language is represented by a proof system consisting of this form of rules, called a sequential sequent calculus, which is sound and complete with respect to the intuitionistic propositional logic. One important implication of this result is that a bytecode language can be translated to and from other proof systems of the intuitionistic propositional logic. Compilation is one such system, which translates a proof of the natural deduction into this proof system. Moreover, it is shown that the converse is also possible. The lambda term obtained through the inverse transformation is non-trivial and exhibits the logical structure of the code. This process can therefore be regarded as de-compilation. Below, we outline the proof-directed de-compilation using a simple bytecode language. The set of instructions (ranged over by I) and code blocks (ranged over by B) of the bytecode language are given below. B ::= return j I ;B I ::= iconst n j acc i j pair j fst j snd j code B j call n Instructions iconst n, acc i, and code B push onto the stack the integer value n, the ith element of the stack, and the pointer to the code block B, respectively. Instructions pair, fst and snd are stack operations manipulating pairs. Call n pops the stack n elements and a code pointer, and calls the code with the n arguments. We consider a list  of types as a type of a machine stack with the convention that the right-most formula in a list corresponds to the top of the stack, and interpret a block of this bytecode language as a proof of the sequential sequent calculus. We write jj for the size of  and write i for the ith element in . The term assignment system for the sequential sequent calculus is given Figure 2. The intended semantics of each instruction should be understood by reading the corresponding proof rule \backward". For example, pair changes the stack state from ; 1; 2 to ; 12 indicating the operation that replaces the top-most 2 elements with their product. ; int  B :  ; 1n  B :  ; i  B :    acc i; B :    const n; B :  ; 1; 2  pair; B :  ; 2  B :  ; (0 ) 0 )  B :  (if  B :  ) ; 1  B :  0 0 0 ; 12  fst; B :  ; 12  snd; B :    code B0 ; B :  ; 0  B :  ; (0 ) 0 ); 0  call n; B :  (if j0j = n) Figure 2: A sequential sequent calculus for a simple bytecode language

;   return : 

We consider the following typed lambda calculus as the target of de-compilation.  ::= int j  !  j    M ::= n j x j x:M j M M j (M; M) j fst(M) j snd(M) Its type system is the standard one. We write [M=x]N for the term obtained from N by substituting M for x (based on the usual capture-avoiding substitution for lambda terms). We write [   C : ]] for the lambda term obtained by translating a derivation   C : . The general idea behind the translation is to assign a name to each element in the assumption list , and to proceed by induction on the derivation of the code. We write si for the name of the ith element in the list (counting from left staring with 0). The lambda term corresponding to the axiom is the name of the formula. For a compound proof, the algorithm rst obtains the term corresponding to its sub-proof. It then applies the transformation corresponding to the rst instruction to obtain the desired lambda term. The following equations de ne the transformation. [ ;   return : ]] = sjj [   acc i; B : ]] = [si =sjj ][[; i  B : ]] [   iconst n; B : ]] = [n=sjj][[; int  B : ]] 4

[ ; 1; 2  pair; B [ ; 12  fst; B [ ; 12  snd; B [   code B0 ; B [ ; (0 ) 0); 0  call n; B

: ]] : ]] : ]] : ]] : ]]

= = = = =

[(sjj ; sjj+1 )=sjj ][[; 12  B : ]] [fst(sjj )=sjj ][[; 1  B : ]] [snd(sjj )=sjj ][[; 2  B : ]] [s0 : : : :sj0 j?1 :[[0  B0 : 0 ] =sjj][[; (0 ) 0 )  B : ]] [(sjj sjj+1    sjj+n )=sjj ][[; 0  B : ]] (n = j0j)

The above algorithm provides a de-compilation algorithm for a simple bytecode language. However, signi cantly more work is needed to apply this approach to Java bytecode language, to which we now turn.

3 JAL0 : the JVM assembly language without objects Compared to the simple bytecode language considered above, the Java bytecode has the following additional features. 1. Restricted stack access and local variable support. JVM does not include an instruction to access an arbitrary stack element. This restriction is compensated by JVM's support for directly accessible mutable local variables. 2. Labels and jumps. As in most of existing computer architectures, JVM uses labels and jumps to realize control ow. 3. Classes, objects and methods. To support object-oriented features of Java, JVM architecture is based on methods and objects, instead of functions and records. Moreover, objects in JVM involves cyclic structure and mutability. Since the third feature of objects requires signi cantly new machinery, we divide our development in two stages. In this section, we de ne a JVM assembly language without objects, denoted by JAL0 , supporting the rst two features, and present its proof system and the proof-directed de-compilation algorithm for JAL0 . Later, in Section 4, we describe the necessary extensions to objects and methods.

3.1 Syntax of the language

With the introduction of labels and jumps, an executable program unit is no longer a sequence of instruction, but a collection of labeled blocks, mutually referenced through labels. We let l range over labels. The syntax of program units (ranged over by K), blocks (ranged over by B) and instructions (ranged over by I) of JAL0 are given below. K ::= fl : B;    ; l : B g B ::= ireturn j goto l j I ;B I ::= iconst n j iadd j pop j dup j swap j iload i j istore i j iadd j ifzero l Ireturn is for returning an integer value. Goto l transfers control to the block labeled l. Iconst n is the same as before. Pop, dup, swap are the stack operations for popping the stack, for duplicating the top element, and for swapping the top two elements, respectively. Iload i pushes the contents of the local variable i onto the stack. Istore i pops the stack the top value and stores it in the local variable i. Iadd pops the stack two integers and pushes back the sum of the two. Ifzero l pops the stack the top element, and if it is 0 then transfers control to the code block labeled l. A JAL0 program is a program unit K with its entry label l,

written K:l. The following example is a JAL0 program, which takes an integer input through local variable i0 and returns 1 if it is 0, otherwise returns 0. 9 8 < l : iload i0 ; ifzero l0; iconst 0; goto l00 = K0 :l  : l0 : iconst 1; goto l00 ; :l l00 : ireturn 5

3.2 The proof system for JAL0

Each JAL0 instruction operates on the stack and the local variables, and is represented as a proof rule of the form ?;   B :  ?0 ; 0  I ;B :  where ? is a local variable context which is a function from a nite set of local variables (ranged over by i) to value types,  is a stack context which is a nite sequence of types, and  is a value type of the block B. We use the notation ?fi :  g for the function ?0 such that dom(?0) = dom(?) [ fig, ?0 (i) = , and ?0(i0 ) = ?(i0 ) for any i0 2 dom(?); i0 6= i. In the real JVM, a value type is either some primitive type or an object type (denoting a reference to an object). In JAL0 considered in this section, the only possible value type is \int" of integers. Later in Section 4 where we extend JAL0 , we include object types. Since a block in general refers to other blocks through labels, the typing of a block is determined relative to an assumption on the types of blocks assigned to the labels. We de ne a block type to be a logical sequent of the form ?;    denoting possible blocks B such that ?;   B : . We let  range over label contexts , which is a mapping from a nite set of labels to block types. The type system of JAL0 is given as a proof system to derive the following forms of judgments.  ` ?;   B :  block B has block type ?;    under  ` K :  program unit K is typed with  ` K:l : ? )  program K:l has program type ? )  A program unit K is a collection of labeled blocks, and is typed with a label context . Since the structure of blocks is xed,  is xed for the entire program. For this reason, we make  implicit in each typing rule, and simply write ?;   B : . The type system de ning these relations are given in Figure 3. Typing rules for blocks: ?;   B :  0 ?; ; int  ireturn : int ?;   goto l :  (if (l) = ?;   ) ?; ;   pop; B :  0 ?; ; ;   B :  0 ?; ;  0;   B :  00 ?; ; int  B :  0 ?; ;   dup; B :  ?; ; ;  0  swap; B :  00 ?;   iconst n; B :  ?; ; int  B :  ?fi : intg;   B :  ?;   iload i; B :  (if ?(i) = int) ?; ; int  istore i; B :  ?;   B :  ?; ; int  B :  ?; ; int; int  iadd; B :  ?; ; int  ifzero l; B :  (if (l) = ?;   ) Typing of program units: for each l 2 dom()  ` ?;   K(l) :  such that (l) = ?;    `K : Typing of programs: ` K :  and (l) = ?; ;   for some  ` K:l : ? )  Figure 3: The Type System of JAL0 Some explanations on the type system are in order.  The de nition ` K :  for typing a program unit K is essentially the same as the typing rule for recursive de nitions in a functional language. Later we shall show that a program unit consisting of mutually referencing labeled blocks is de-compiled to mutually recursive functions.  The rules for stack instructions pop, dup and swap correspond to logical rules for weakening, contraction and exchange, respectively. The rules for iload i and istore i can also be understood as contraction and weakening rules across two assumption sets, respectively. From this, one observes that a machine with restricted stack access corresponds to a logic with explicit structural rules. 6

 Conditional branch and jump instructions are, as mentioned in the beginning of this section, considered as rules referring to other blocks in a program unit. These rules require that the type of the referenced block has the same local variable context, the same stack context and the same return type as those of the reference point. This strict requirement is appropriate for ordinary branch instructions. As observed by Stata and Abadi [17], however, more exible constraint is required for JVM subroutines. For example, the program unit K0 given in the previous subsection has the following typing 9 8 < l : fi0 : intg; ;  int; = ` K0 : 0 where 0  : l0 : fi0 : intg; ;  int; ; l00 : fi0 : intg; int  int

and therefore ` K0 :l : fi0 : intg ) int.

3.3 Operational semantics of JAL0 and the type soundness

The language JAL0 intends to model (a subset of) JVM bytecode. As such, we de ne its operational semantics by specifying the e ect of each instruction on a machine state. A machine state is described by a triple (S; E; B) of a local variable environment E which is a mapping from a nite set of local variables to run-time values, a stack S which is a sequence of run-time values, and a current block B where the left-most instruction is the next one to be executed. For JAL0 , the possible run-time values (ranged over by v) are integers. We write K ` (S; E; B) ?! (S 0 ; E 0; B 0 ) if the con guration (S; E; B) is transformed to (S 0 ; E 0 ; B 0 ) under the program unit K. Figure 4 gives the  set of transformation rules. The re exive transitive closure of this relation is denoted by K ` (S; E; B) ?! 0 0 0 (S ; E ; B ). A program K:l computes a value v under an initial local variable environment E, written  (v; ;; ;). E ` K:l + v, if K ` (;; E; K(l)) ?! K ` (S; v; E; ireturn ) ?! (v; ; ) K ` (S; v; E; pop; B) ?! (S; E; B) K ` (S; v; E; dup; B) ?! (S; v; v; E; B) K ` (S; E; iconst n; B) ?! (S; n; E; B) K ` (S; n1; n2; E; iadd; B) ?! (S; n1 + n2; E; B) K ` (S; v1 ; v2 E; swap; B) ?! (S; v2 ; v1 ; E; B) K ` (S; E; iload i; B) ?! (S; v; E; B) if E(i) = v K ` (S; v; E; istore i; B) ?! (S; E fi vg; B) K ` (S; 0; E; ifzero l; B) ?! (S; E; K(l)) if n = 0 K ` (S; n; E; ifzero l; B) ?! (S; E; B) if n 6= 0 K ` (S; E; goto l; B)) ?! (S; E; K(l)) Figure 4: Operational Semantics of JAL0 We show that the type system we de ned for JAL0 is sound with respect to this operational semantics. We write j= v :  if v has type , and de ne the typing relations for environments and stacks as follows. E j= ? () dom(E) = dom(?), and j= E(i) : ?(i) for all i 2 dom(E) S j=  () jS j = jj, and j= Si : i for all 1  i  jS j 0 Since JAL only contains integers, the typing relation for values is the trivial relation between integers and int, but can easily be extended to other primitive types. We can then show the following. Theorem 1 If ` K : ,  ` ?; B :  , E j= ?, S j= , and B 6= return then K ` (S; E; C) ?! (S 0 ; E0; C 0) such that  ` ?0; 0  B 0 :  , E 0 j= ?0 , and S 0 j= 0 for some ?0 ; 0. Proof is by induction on the derivation of B. From this, we have the following. Corollary 1 If E j= ? and ` K:l : ? )  then either E ` K:l + v such that j= v :  or the computation of a program K:l under E does not terminates. 7

3.4 Proof-directed de-compilation of JAL0

We now develop a proof-directed de-compilation algorithm of JAL0 to the lambda calculus. To do this, we need to account for jump instructions and local variables. Our strategy for dealing with labels and jump instructions is to translate a labeled block as a function taking the set of local variables described in ? and the set of stack elements described in  as its arguments, and to translate a jump to a label as a tail call of the function corresponding to the label. As we mentioned earlier, manipulation of local variables can be modeled by structural rules across a local variable context and a stack context. Their mutability is reduced to introducing a new binding each time a value is store to a variable, and therefore no additional mechanism is required. This treatment is analogous to implementation of an imperative program by transforming it into a static single assignment form [1]. Since control ow of a program unit may have cycles in general, we take the lambda calculus with mutual recursive de nitions given below, as the target language of de-compilation.  ::= int j  !  M ::= n j iadd(M; M) j x j x:M j M M j ifzero M then M else M j rec fx = M;    ; x = M g in M rec fx1 = M1 ;    ; x = Mng in M is a term M with mutually recursive de nitions, where x1 ; : : :; xn may appear in M1 ; : : :; Mn and are bound in M. The typing rule for this construct is similar to the one for mutually recursive function de nitions found in functional languages. We refer this language as rec. To present our de-compilation algorithm, we introduce some de nitions and notations. We assume that the set of lambda variables is the disjoint union of a given countably in nite set of label variables indexed with labels, a given countably in nite set of local variables indexed with natural numbers, and a given countably in nite set of stack variables indexed with natural numbers. We use a meta variable l for label variables as well as labels. We write ij for the local variable of index j, and we write si for the stack variable of index i. We assume a xed liner order on the set of local variables, and denote the ordered sequence of the set of variables in dom(?) by i1 ; : : :; ij?j where j?j is the number of elements in dom(?). We use the following notation for application of f to the variables corresponding to all the assumptions in contexts.

apply(f; ?; ) = f i1    ij?j s0    sjj?1 For a context ?;  of JAL0 judgment, it translation ?;  in rec is de ned as follows.  z 2 dom(?) ?; (z) = ?(z) j +1 z = sj ^ 0  j < jj: The translation of a block type to a type in rec is given below. ?;    = ?(i1 ) !    ! ?(ij?j ) ! 1 !    ! jj ! : This de nition naturally extends to label contexts. We write  for the translation of  in rec. With these preparation, the de-compilation algorithm for blocks is given by the following equations. [ ?; ; int  ireturn : int]] [ ?; ;   pop; B : ]] [ ?; ;   dup ; B : ]] [ ?; ; ;  0  swap; B : ]] [ ?;   iconst n; B : ]] [ ?; ; int; int  iadd; B : ]] [ ?;   iload i; B : ]] [ ?; ; int  istore i; B : ]] [ ?; ; int  ifzero l; B : ]] [ ?;   goto l : ]]

= = = = = = = = = =

sjj [ ?;   B : ]] [sjj =sjj+1 ][[?; ; ;   B : ]] [sjj+1 =sjj ; sjj =sjj+1 ][[?; ;  0;   B : ]] [n=sjj][[?; ; int  B : ]] [iadd(sjj ; sjj+1 )=sjj ][[?; ; int  B : ]] [i=sjj ][[?; ; int  B : ]] [sjj =i][[?fi : intg;   B : ]] ifzero(sjj ; apply(l; ?; ); [ ?;   B : ]]) apply(l; ?; ) 8

We denote by ?;   B :  the lambda term i1 :    ij?j :s0 :    sjj?1 :[[?;   B : ]]. Let K be a JAL0 program such that ` K : . Let fl1 ; : : :; lng = dom() and (li ) = ?i; i  i for 1  i  n. The translation K:l of K:l under  is given as follows. K:l = rec fl1 = ?1 ; 1  K(l1 ) : 1 ;    ; ln = ?n ; n  K(ln ) : n g in l: The above de-compilation algorithm is a proof translation and as such it preserves typing. We rst show that the translation of blocks preserves typing. Lemma 1 Let  be a label context, and assume  ` ?;   B :  . Then  [ ?;   [ ?;   B : ]] :  holds in rec . This is proved by simple induction. Using this lemma, it is shown that the de-compilation of an entire program preserves its type as stated in the following. Theorem 2 If ` K:l : ? )  and ` K :  then the judgment

;  K:l : ?; ;   is derivable in rec .

Let us consider the example program unit K0 . First, we translate each blocks of K0 . [ fi0 : intg; ;  K0 (l) : int]] = ifzero (i0 ; l0 i0 ; l00 i0 0)) [ fi0 : intg; ;  K0 (l0 ) : int]] = l00 i0 1 [ fi0 : intg; int  K0 (l00 ) : int]] = s0 Thus a code with entry point K0 :l is translated to 8 0 i0 ; l00 i0 0) 9 l = i : ifzero (i ; l < = 0 0 K0 :l0 = rec : l0 = i0 :l00 i0 1 ; in l: l00 = i0 :s0 :s0

where 0 is the label context such that ` K0 : 0 .

4 Bytecode with objects and methods This section extends the framework for proof-directed de-compilation with classes, objects and methods. This requires us to develop a logical interpretation of classes and JVM instructions for object manipulation. We see two approaches for doing this. The rst is to encode those structures using existing type theoretical notions. A number of researches on type systems for object-oriented programming reveal that a simple class structure such as the one adopted in Java can be represented using records, function closures and recursive types. As we have shown in Section 2, the logical framework for code analysis can deal with records and function closures. By adding a mechanism for recursive types, it should be possible to extend our framework to classes and objects. The second approach is to consider classes and objects as a given extra-logical structures, and to develop a logical framework parameterized with these structure. Here we take the latter approach for the following two reasons. First, object manipulation instructions are primitives, and the JVM speci cation only describes their expected behavior, leaving their implementation structures to the designer of a JVM interpreter. Although the recursive record model appear to be the most natural one and closely re ects existing implementations, this is outside the speci cation of JVM. The second reason is related to abstraction of objects. One advantage of an object-oriented language is that it provides the programmer an abstract view of objects. In this respect, the recursive record model of objects is insucient, and in order to provide proper account for data abstraction, signi cantly more machinery will be required. Although such development is useful in understanding the essence of an object-oriented bytecode language, it is outside the scope of the present paper. As an analysis for the bytecode language itself, we believe that it is more appropriate to make this structure remain abstract. 9

4.1 Extending types with classes and methods

Java is a class-based programming language supporting single inheritance with additional features of abstract classes, method overloading and interface classes with multiple super interfaces. Although JVM contains some instructions speci c to these feature, these are compile-time structures made available to the run-time system. In this work, we do not consider those advanced compile-time structures and concentrate on the basic mechanism for creating objects and invoking their own methods. There still remains one complication in this basic model related to object initialization. As observed by Freund and Mitchell [3], a straightforward formulation of a type system for Java bytecode language is unsound due to the possibility of accessing uninitialized object. Their solution is to distinguish types of initialized objects from those of uninitialized ones by indexing the type of an uninitialized object with the invocation of the corresponding object creation method. Although their type system is based on the one by Stata and Abadi, this mechanism has the generality and it can be adopted to our logical framework. In the following, we develop a proof theory incorporating this mechanism. We assume that there is a countably in nite set of (initialized) class identi ers , ranged over by c. We also assume that there is a set of countably in nite object indexes , ranged over by u, each of which indicates invocation of the object creation method, and we assume that for each c there is a countably in nite set of uninitialized class identi ers indexed by object indexes. We write cu for the uninitialized class identi er of index u associated with c. We let  range over both initialized and uninitialized class identi ers. The syntax of types is extended with class identi ers as follows.  ::= c j cu  ::= int j  An initialized class identi er is associated with eld types and method types, both of which may contain class identi ers. We let f range over eld names, and let m range over method names. A eld type is some of  manipulated in JAL except for uninitialized class identi ers. A method type is a program type of the form ? )  we have used before. To specify a class structure, we introduce a class context , ranged over by C , which associates each initialized class identi er c with a set of typed eld names and a set of typed method names. We indicate the fact that c is associated with the set of elds fi of type i (1  i  n) and methods mj of type ?j ) j (1  j  m) using the following notations.

C (c) = ff1 : 1 ;    ; fn : n j m1 : ?1 ) 1;    mm : ?m ) m g C (c):fi = i C (c):mj = ?j ) j A class context is generated at the compile-time and made available to the run-time system. In the following development, we assume that a xed class context C is given.

4.2 JAL: JAL0 with objects and classes

The set of instructions are extended as follows.

I ::=    j aload i j astore i j new c j init c j invoke c; m j get eld c; f j put eld c; f; B Intuitive explanation of additional instructions are the following. Areturn , aload and astore have analogous behavior on object references as the corresponding ones on integers. New c creates an uninitialized object instance of class c and pushes its reference onto the stack. Init c pops the stack an uninitialized object reference and initializes it. Practically, this instruction corresponds to invoking initialization method of uninitialized object. We omit its arguments for simplicity. Invoke c; m invokes a method m on an object of class c by popping the stack m arguments and an object and transferring control to the method code m of the object. The return value of the method is pushed onto the stack. Get eld c; f and set eld c; f reads and writes eld f of instance objects of class c respectively. Invoke, get eld and set eld instructions fail if they operate on uninitialized object instances. 10

The type system of JAL is de ne by a proof system to derive the following forms of judgments. C ;  ` ?;   B :  block B has block type ?;    under  and C C ` K :  program unit K is typed with  under C C ` K:l : ? )  program K:l has program type ? )  under C Besides the fact that they are de ned relative to a given class context C , the intended meaning of these relations can be understood similarly to the corresponding relations in JAL0 . The rules for blocks not involving object manipulation instruction are the same as those of JAL0 . Other rules are given in Figure 5. Note that in specifying typing rules for blocks, xed C and  are implicitly assumed. In the rule invoke, we assume that the set of variables fi0 ; : : :; ing in the premise satisfy the relation i0 < : : : < in where < is the linear ordering we have assumed on local variables. In the rule for init, [c=cu]? ([c=cu]) is the ?0 (0 ) obtained from ? () by substituting c for each occurrence of cu . The mechanism for type safe object initialization realized by the rules for new and init is the adaptation of that of [3]. Typing rules for blocks involving objects: ?fi : g;   B :  ?; ;   B :  ?(i) =  ?; ; c  areturn : c ?;   aload i; B :  ?; ;   astore i; B :  ?; ; cu  B :  c 2 dom(C ) u fresh [c=cu]?; [c=cu]  B :  ?;   new c; B :  ?; ; cu  init c; B :  0 ?; ;   B :  C (c):m = fi0 : c; i1 : 1 ;    ; in : n g )  ?; ; C (c):f  B :  ?; ; c; 1;    ; n  invoke c; m; B :  0 ?; ; c  get eld c; f; B :  ?;   B :  ?; ; c; C (c):f  put eld c; f; B :  Typing of program units: for each l 2 dom() C ;  ` ?;   K(l) :  such that (l) = ?;    C`K: Typing of programs: C ` K :  and (l) = ?; ;   for some  C ` K:l : ? )  Figure 5: The type system for JAL The soundness of the type system must be shown with respect to an actual model of classes and object structures. As we suggested earlier, such a model can be given using records, function closures and recursive types. This is outside the scope of the present paper, and we leave this to a future work.

4.3 De-compilation algorithm

The target of de-compilation is a lambda calculus extended with a mechanism for representing objects. In addition to class structures, some imperative feature is inevitable. In de-compiling JAL0 , we have successfully interpreted its imperative features in a proof theoretical framework using conventional logical inference rules, and we can de-compile it to a functional language (with a xed point construct to represent possibly cyclic label references.) With objects, this is no longer possible since mutable objects are globally shared by other objects. For this reason, a language for the target of de-compilation must include imperative features. We continue to assume that the class structures represented by a class context C is given and xed. Our target language for de-compilation is de ned relative to C . The set of types and terms of the target language are given below.  ::= c j cu  ::=  j int j  !  L ::= n j iadd(L; L) j x j x:M j L L 11

M ::= L j rec fx = L;    ; x = Lg in M j let x = M in M j ifzero M then M else M j new c j x:init; M j x:m(L;    ; L) j x:f j x:f := M; M The last ve expressions are those for object creation, object initialization, method invocation, object eld extraction and object eld update. Since object eld update is a command and does not produce any value, the corresponding expression x:f := M1 is followed by another expression M2 forming a compound expression x:f := M1 ; M2. The type system of this language is given in Figure 6. We should note that this type system does not take into account of uninitialized object types. Because of the higher-order feature, the Freund and Mitchell's technique does not easily extends to the lambda calculus. In this type system, uninitialized object types of the form cu are identi ed with c. As a result, we can only show the type preservation up to this identi cation. We will include discussion on this subtle point in the nal version. ?fx : 1g  M1 : 2 ?  n : int ?fx :  g  x :  ?  x:M1 : 1 ! 2 ?  L1 : 1 ! 2 ?  L2 : 1 ?  M1 : 1 ?fx : 1 g  M2 : 2 ?  L1 L2 : 2 ?  let x = M1 in M2 : 2 ?  M :  ?(x) = c c 2 dom(C ) ?  new c : c ?  x:init c; M :  ?  x : c C (c):m = fi0 : c; i1 : 1 ; : : :; in : n g )  ?  Li : i (1  i  n) ?  x:m(L1 ; : : :; Ln) :  ?  x : c C (c):f =  ?  x : c C (c):f = 1 ?  M1 : 1 ?  M2 : 2 ?  x:f :  ?  x:f := M1 ; M2 : 2 ?fx1 : 1 ;    ; xn : n g  Li : i ?  M :  ?  rec fx1 = L1 ;    ; xn = Ln g in M :  Figure 6: Type System of the Lambda Calculus with Objects The de-compilation algorithm is obtained by extending the one for JAL0 presented in the previous section with the equations for object manipulation instructions. Figure 7 shows the additional equations required for this extension. [ ?; ;   areturn : ]] [ ?;   aload i; B : ]] [ ?; ;   astore i; B : ]] [ ?;   new c; B : ]] [ ?; ; cu  init c; B : ]] [ ?; ; c; 1;    ; n  invoke c; m; B :  0]

= = = = = =

sjj [i=sjj ][[?; ; ?(i)  B : ]] [sjj =i][[?fi : g;   B : ]] let sjj = new c in [ ?; ; cu  B : ]] sjj :init c; [ [c=cu]?; [c=cu]  B : ]] let sjj = sjj :m(sjj+1 ; : : :; sjj+n ) in [ ?; ;   B :  0 ] where C (c):m = fi0 : c; i1 : 1 ;    ; in : n g )  [ ?; ; c  get eld c; f; B : ]] = let sjj = sjj :f in [ ?; ; C (c):f  B : ]] [ ?; ; c; C (c):f  set eld c f; B : ]] = sjj :f := sjj+1 ; [ ?;   B : ]] Figure 7: De-compilation algorithm for object primitives in JAL

As in the case for the de-compilation algorithm for JAL0 , this algorithm is a proof transformation from a sequential sequent calculus to (an extension of) the natural deduction, which implies the preservation of typing. The following theorem formally establish this property. Theorem 3 If C;  ` ?;   B :  then the judgment C `  [ ?;   [ ?;   B : ]] :  12

is derivable in the lambda calculus with objects.

This result is taken with the identi cation of uninitialized object types of the form cu with the object type c we mentioned earlier.

5 A prototype implementation of a de-compiler We have implemented a prototype de-compiler, JD, based on the translation algorithm presented in this paper. JD supports more instructions and types than those we have considered in the formal framework, including arithmetics, arrays, double-word types and so on. Input to JD is an JVM assembly language source le in the format of Jasmin described in [8], which can be mechanically constructed from a JVM class le. JD rst parses a given source le to obtain an internal representation of a set of sequences of JVM instructions, each of which corresponds to one method. JD then converts each sequence of instructions into a program unit consisting of blocks. In doing this, JD considers a cascaded block as a collection of blocks connected by implicit jumps, and inserts jumps to make the block structure explicit. This insertion does not change the semantics of the program. Finally, JD de-compiles each program unit by applying the algorithm presented in this paper to generate a term in the lambda calculus with objects, which is pretty-printed to the output stream. As we mentioned in Introduction, the example we showed in Figure 1 is the actual out generated by JD. In addition to dealing with more instructions and types, JD also performs some more extra work than the proof-directed de-compilation we presented in the previous sections. One is optimization of the resulting lambda term by removing intermediate labels and temporary variables. Since most JVM bytecode programs consist of many small blocks, without this processing, the resulting lambda term becomes dicult to understand due to many labels and variables that are not essential to the control ow of the program. JD achieves this by analyzing redundant labels and variables and performs an operation corresponding to the reduction only if the term being substituted does not involve any imperative features. In the gure 1, the block corresponding to label L5 is eliminated by this optimization process. JD also properly deals with double-word types such as long and double, which we have omitted in our formal treatment for the simplicity of presentation. Values of these types occupy two contiguous words in the stack, and therefore a type of those values in the stack context does not correspond to a word in the stack. This means that JVM stack manipulation instructions may no longer correspond to logical inference rules. As an example, consider dup2 instruction which duplicates two words on the top of the stack. The logical e ect of this instruction depends on whether the contents of the stack is one double-word value or two single-word values. JD handles this correctly.

6 Conclusions and further investigations We have developed a framework for proof-directed de-compilation of Java bytecode based on a Curry-Howard isomorphism for machine code. We have rst de ned a proof system for the Java bytecode language including integer primitives, stack operations, local variable manipulation, conditional and unconditional jumps. In this proof system, each machine instruction is represented as a rule similar to a left rule in the sequent calculus, and a jump instruction is interpreted as a rule that refers to an existing proof. We have then given an algorithm to transform this proof system to the lambda calculus with mutually recursive function de nitions. The resulting lambda term is not a trivial encoding of a sequence of machine instructions, but re ects the behavior of the given bytecode program. This process therefore serves as proof-directed decompilation of Java bytecode to a high-level language. We have extended this framework for classes, objects, and methods. A prototype de-compiler for Java bytecode has been implemented, which demonstrates the feasibility of the proof-directed de-compilation approach presented in this paper. We believe that the results presented here will open up a new possibility in network computing and other computation paradigms where executable code are dynamically exchanged and used.

13

References [1] R. Cytron, J. Ferrante, B.K. Rosen, M.N. Wegman, and F.K. Zadeck. Eciently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems, 13(4):451{490, 1991. [2] C. Flanagan, A. Sabry, B.F. Duba, and M. Felleisen. The essence of compiling with continuation. In Proc. ACM PLDI Conference, pages 237{247, 1993. [3] S. Freund and J. Mitchell. A type system for object initialization in the Java byte code language. In Proceedings of ACM Conference on Object-Oriented Programming Systems, Languages and Applications, pages 310{328, 1998. [4] J. Gallier. Constructive logics part I: A tutorial on proof systems and typed -calculi. Theoretical Computer Science, 110:249{339, 1993. [5] G. Gentzen. Investigation into logical deduction. In M.E. Szabo, editor, The Collected Papers of Gerhard Gentzen. North-Holland, 1969. [6] J. Gosling, B. Joy, and G. Steele. The Java Language Speci cation. Addison-Wesley, 1996. [7] T. Lindholm and F. Yellin. The Java virtual machine speci cation. Addison Wesley, second edition edition, 1999. [8] J. Meyer and T. Downing. Java Virtual Machine. O"Reilly, 1997. [9] G. Morrisett, K. Crary, N. Glew, and D. Walker. Stack-based typed assembly language. In Proc. International Workshop on Types in Compilation, Springer LNCS 1478, 1998. [10] G. Morrisett, D. Walker, K. Crary, and N. Glew. From system F to typed assembly language. In Proc. ACM Symposium on Principles of Programming Languages, 1998. [11] G. Necula and P. Lee. Proof-carrying code. In Proc. ACM Symposium on Principles of Programming Languages, pages 106{119, 1998. [12] Robert O'Callahan. A simple, comprehensive type system for Java bytecode subroutines. In Proceedings of ACM Symposium on Principles of Programming Languages, pages 70{78, 1999. [13] A Ohori. A Curry-Howard isomorphism for compilation and program execution. In Proc. Typed Lambda Calculi and Applications, Springer LNCS 1581, pages 258{179, 1999. [14] A Ohori. The logical abstract machine: a Curry-Howard isomorphism for machine code. In Proceedings of International Symposium on Functional and Logic Programming, 1999. [15] G. Pottinger. Normalization as a homomorphic image of cut-elimination. Ann. Math. Logic, 12:323{357, 1977. [16] D. Prawitz. Natural Deduction. Almqvist & Wiksell, 1965. [17] B. Stata and M. Abadi. A type system for Java bytecode subroutines. In Proc. ACM Symposium on Principles of Programming Languages, pages 149{160, 1998. [18] Proceedings of Working Conference on Reverse Engineering (WCER). IEEE Computer Society Press, 1993{. [19] J. Zucker. The correspondence between cut-elimination and normalization. Ann. Math. Logic, 7:1{112, 1974.

14

Suggest Documents