Compact Code Generation through Custom Instruction Sets - CiteSeerX

Nat.Lab. Unclassified Report 822/98 Date of issue: 09/98

Compact Code Generation through Custom Instruction Sets Unclassified version of TN 417/96

Maarten Wegdam. Revised by Rik van de Wiel

Unclassified Report c Philips Electronics 1998

822/98

Unclassi ed Report

Authors' address data: Maarten Wegdam Rik van de Wiel. WL11; [email protected]

c Philips Electronics N.V. 1998

All rights are reserved. Reproduction in whole or in part is prohibited without the written consent of the copyright owner.

ii


Unclassi ed Report

Unclassi ed Report: Title: Author(s):

Keywords: Abstract:

Conclusions:


822/98

822/98 Compact Code Generation through Custom Instruction Sets Unclassi ed version of TN 417/96 Maarten Wegdam. Revised by Rik van de Wiel compilers; embedded systems; code compression; instruction sets; RISC; MIPS This report describes the use of a custom instruction set to optimize for code size. A compiler is described that accepts a C program as input and generates a custom instruction set for that program that minimizes the resulting code size when compiling with this instruction set. The method uses a greedy heuristic that tries dierent instruction sets. The chosen instruction set de nes a virtual machine. The input program is translated for this virtual machine, and an interpreter is generated that implements this virtual machine. Experiments show that on a RISC processor (MIPS) the resulting code size is typically 13 of the original code size. The compressed code runs considerably slower than the original code. In our opinion achieving signi cantly smaller code sizes is only possible if the front end of the compiler is modi ed to produce more equal IR expression trees. Other conclusion: The compression is obtained for both applications used in embedded systems as for 'general' applications. It is possible to mix compressed code with normally compiled code. An Instruction set generated for one application also reduces the code size of other applications. Several of the instruction sets reduce the code size of all test programs to at least 51% of the original code size. Instead of implementing the Virtual Machine with a (software) interpreter, this could be done in hardware (microcode). The run-time of the compressed code would be equal to that of the normal compiled code. iii

822/98

iv

Unclassi ed Report


Unclassi ed Report

822/98

Contents Preface

vii

1 Introduction 1.1 1.2 1.3 1.4

Problem description . Related work . . . . Approach . . . . . . Overview . . . . . . .

2 Basic compiler 2.1 2.2 2.3 2.4

. . . .

. . . .

. . . .

Overview . . . . . . . . . . The base Virtual Machine The compiler . . . . . . . Results . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

1

1 2 4 5

6

6 7 10 17

3 Enhancements

21

4 Conclusions and Future work

35

Bibliography

43

A The base operators

45

B Raw data

47

Glossary

60

3.1 Doing it faster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Doing it better . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 General space-saving optimizations . . . . . . . . . . . . . . . . . . . 32 4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


v

822/98

vi

Unclassi ed Report


Unclassi ed Report

822/98

Preface This is my master thesis and nal part of my study Computing Science (in dutch: Informatica) at the University of Groningen. The research is done for Philips Research and is done at the Philips Natuurkundig Laboratorium (Natlab) in Eindhoven, The Netherlands. I worked in the Compiler Technology cluster (Schepers), part of the Information Technology group (Treers) of the sector Information and Software Technology. My supervisors were Eelco Dijkstra (both Philips Natlab and University of Groningen) and Joachim Trescher (Philips Natlab).

Acknowledgments I would like to thank my supervisors Eelco Dijkstra and Joachim Trescher. Their input was indispensable, both for the research itself as for this report. I want to thank all the members of the Compiler Technology cluster for the pleasant working environment they provided. I also want to thank Prof. C. Bron, Prof .W.H. Hesselink and Drs. J.H. Jongejan for their involvement as members of my graduation committee. Maarten Wegdam

Note This report is an unclassi ed version of Nat.Lab. technical note TN 417/96. Rik van de Wiel


vii

Unclassi ed Report

822/98

1 Introduction The purpose of this research is to construct a compiler that generates executables which have at most half the code size of executables generated by existing compilers. The execution time of the compressed executable is allowed to increase with a factor 10-20. This research is limited to researching the possibilities of achieving this goal using custom instruction sets. The research is focused on the application of this technique in embedded systems, in particular for the MIPS processors that Philips manufactures for use in embedded systems.

1.1 Problem description When generating object code, two eciency aspects of the generated code are considered: the execution time and the size of the executable. Sometimes energy consumption is considered as a third eciency-aspect. (This is relevant for portable embedded systems which have limited energy storage.) For most programs the execution time is by far the most important aspect. Modern compilers therefore often try to save time, even at the expense of space (e.g. inlining). This is not desirable for all applications. For embedded systems in printers, game controllers, appliances and the like a small, xed-size memory is often the limiting design constraint. In [LDK+96] two trends are mentioned in the design of embedded systems. First, considerations for cost, power, and reliability are forcing designers to incorporate all the electronics { microprocessor, program ROM, RAM, and application-speci c circuit components { into a single integrated circuit. Second, the amount of software incorporated into embedded systems is growing larger and more complex. The rst trend elevates code density to a new level of importance because program code resides in on-chip ROM, the size of which translates directly into silicon area and cost. Moreover, designers often devote a signi cant amount of time to reduce code size so that the code will t into available ROM, because exceeding on-chip ROM size could require expensive redesign of the entire IC and even of the whole system. The second trend { increasing software and system complexity { mandates the use of high-level languages in order to decrease development costs and time-to-market. Since current compilers leave much room for improvement, especially when it comes to code size, programming in a high-level language can incur penalties on code size (and performance for that matter). It is generally recognized that RISC processors provide higher performance without requiring extra chip area compared to CISC processors. This higher performance does come at the price of larger code size. This is one of the major reasons why the cost-conscious embedded systems market has been reluctant to use RISC processors, like the MIPS.


1

822/98

Unclassi ed Report

Not only in embedded systems the code size can be the critical aspect:

There exist languages for transmitting Web clients over the Internet (Java).

For such programs, interpretive overhead can be less important than program size. Optimizing linkers and loaders that generate code on the y [Fra94] incur i/o and code-generation costs that rise with the number of operators that they read. Compressing the input improves performance.

1.2 Related work Since compilers usually are made to optimize for minimal execution time, far more research is done in that area than in optimizing for minimal space. This doesn't mean that no research is done when it comes to code compression. Franz describes in [Fra94] a technique for representing programs abstractly and independently of the eventual target architecture. This representation is twice as compact as machine code for a CISC processor. The code generation is deferred until the time of loading, at which native code is created on the y by a codegenerating loader. The process of loading with dynamic code-generation is so fast that it requires little more time than the input of equivalent native code from a disk storage medium. The representation, called semantic-dictionary encoding (SDE), encodes a source program by a sequence of indices into a semantic dictionary. This semantic dictionary contains the information necessary to generate native code. The dictionary is constructed during the translation of a source program to SDE form, and reconstructed before (or during) the decoding process. This method bears some resemblance to commonly used data compression schemes ([Wel84]). Advanced RISC Machines Limited (UK) has extended their architecture to reduce code size by developing a new instruction set, named Thumb ([Lim95] and [Tur95]). Thumb contains 16-bit wide opcodes which are a subset from the standard 32bit ARM instruction set with a compact encoding. On execution these new 16bit Thumb opcodes are decompressed by an instruction predecoder to their ARM instruction set equivalents. Because the core can execute both the standard ARM instruction set and the Thumb instruction set, the programmer can choose per subroutine between code size and speed. The new 16-bit instruction set is not complete, for example exception handling can't be performed with Thumb and some instructions can't be represented in Thumb. In those cases one has to rely instead on the original 32-bit instruction set. The resulting code density improvement is 25-35% compared to ARM code. [FMW84] claims that the two most important optimizations that save space are procedural abstraction and cross-jumping. Procedural abstraction turns repeated code fragments into procedures, either at Intermediate Representation level or at source level. Cross-jumping reuses the common tail of two merging code sequences. A generalization of these two methods was tried on assembly code. The resulting 2


Unclassi ed Report

822/98

compression ratios varied between 0-39%, with an average of 7%. The compressed code took 1-5% more CPU time, but (on non-embedded systems) as much as 11% less real time, presumably because it loads faster. [KW94] and the earlier [WC94] describe a method whereby embedded programs are stored in compressed form but executed from the instruction cache in the standard format. The system consists of a standard RISC processor core augmented with a special code-expanding instruction cache. This provides increased storage density with only a minimal impact on processor performance. Experiments show that a practical increase of 15-30% and a theoretical increase of over 100% in code density can be expected using this technique. In this article the obvious, but still true, observation is made that since embedded systems are highly cost sensitive and typically only execute a single program, it is not possible to include temporary storage for the uncompressed version of a program. In [LDK+96] a data-layout algorithm is described that decreases code size. The technique takes advantage of special architectural features of embedded processors. The storage allocation of variables is moved from the front-end to the code generation step that selects addressing modes, thereby increasing opportunities to use ecient autoincrement/autodecrement modes. This technique reduced the number of instructions with an average of about 9% when using multiple address registers. An obvious way to achieve compact program representation is by using an instruction set that has been designed for this purpose. One step further in this direction is to construct a custom instruction set specially for a certain program. The chosen instruction set de nes a Virtual Machine (VM), which can be implemented by using an interpreter. This technique is described in [Pro95] and the unpublished [FP95]. The idea is to automatically construct the VM in such a manner that the size of the (interpretive) code for a certain program is minimal. The second article describes an improvement of the rst, so I will only elaborate on the second. The algorithm works as follows: Starting with a basic VM (and corresponding instruction set), new space-saving combinations of instructions are added to the instruction set. To be more precise: 1. The source program is analysed at Intermediate Representation (IR) level to determine what possible combination of instructions are potential members of the instruction set. 2. A code generator generator is used to construct a code generator for the above potential instructions. The code generator can use any subset of the set of potential instructions. 3. The algorithm evaluates dierent instruction sets to see which one saves the most space. It starts with the necessary primitive instructions and adds repeatedly the potential instruction that saves the most space, given those already selected, to this instruction set. The algorithm determines the amount of space that would be saved by actually encoding the program and measuring the resulting code size. The process of adding potential instructions to the instruction set ends when there are no more unused opcodes.


3

822/98

Unclassi ed Report

4. An interpreter and the interpretive code for the selected instruction set are generated. Step 3 uses a greedy heuristic. Calculating the absolute minimal code size by trying all possible subsets of the instructions generated in step 1 would be unfeasible. The IR operators of the retargetable C-compiler lcc served as a basic instruction set (with a few changes). The custom instruction set algorithm about halves the original code size. There is a time penalty, the interpreter runs about 20 times slower than the original code.

1.3 Approach Beforehand some choices were made about the way this research would be conducted. These choices are listed here.

MIPS

As stated before this technique might be used for compressing programs in embedded systems. This means that the language to compress is C. Philips intends to market MIPS processors for embedded systems, this means that the target machine will be a MIPS R3000. This has consequences for the way the compression ratio is measured.

Proebsting/Fraser

Proebsting/Fraser used custom instruction sets and obtained promising results in [FP95] (and [Pro95]). Therefore this article will act as a starting point. lcc

The (front end of the) retargetable C-compiler lcc, version 3.5 ([FH91b]), will be used. lcc is a well documented, relatively small C-compiler which has an, again relatively, strict separation of front end and back end. The Intermediate Representation of lcc is in the form of trees, where each tree matches about one C-instruction. lburg

The code generator generator lburg (a variant of iburg [FHP92a]) will be used. lburg reads a machine description and writes a code generator which uses tree match-

ing and dynamic programming to compute a least-cost cover for a given input tree. The code generator is used as back end for lcc and has as input the Intermediate Representation trees. It emits an assembly le.

Bytecodes

The instructions will be bytecoded instructions. A bytecoded instruction consists of an opcode of one byte and a sequence of immediate operands. Having 4


Unclassi ed Report

822/98

variable-length opcodes would make implementing (and partly generating) both the interpreter and assembler more dicult. Opcodes with a size of one byte will give us 256 possible instructions.

Stack based Virtual Machine

The basic Virtual Machine is stack-based. A stack-based Virtual Machine can have a base-instruction set that is (almost) equivalent to the Intermediate Representation operators of lcc. Making the machine-description of the Virtual Machine is then quite straightforward. Not having to bother with register-allocation makes compiling for a stack-based Virtual Machine a lot easier. A second bene t of a stackbased machine is that the number of operands of an instruction doesn't need to have an upper bound, which a register-based machine has (typically two or three). The third bene t of a stack-based machine is that mapping an instruction to an opcode is easier, there is only one format of the opcode and the arguments are just in the following bytes instead of in a eld of the instruction itself. It is possible however that a register-based Virtual Machine is faster or that the interpretive code for such a Virtual Machine is smaller. It is not possible to really say anything about this without implementing both and comparing them.

1.4 Overview Chapter 2 describes a compiler that uses the custom instruction set technique. Only the basic technique is implemented, without any improvements. The compression results for a set of benchmark programs is given. In Chapter 3 several enhancements to the compiler as described in Chapter 2 are presented. The eect of these enhancement to the compression of the benchmark programs is given. Chapter 4 discusses the results that have been obtained, draws conclusions and gives some suggestions for further research.


5

822/98

Unclassi ed Report

2 Basic compiler This section describes a compiler which is based on the Fraser/Proebsting algorithm. Only those techniques that are described in [FP95] were implemented. We will refer to the version of the compiler described in this section as the basic compiler. We give a general overview of the compiler in Section 2.1. Section 2.2 contains a brief survey of the base Virtual (stack) Machine that serves as the target machine for the compiler. In Section 2.3 we elaborate on the dierent stages of the compiler. Finally in Section 2.4 we document the performance gures achieved with the basic compiler.

2.1 Overview Our intention is to compute for a given application an instruction set, that allows us to represent the application as compact as possible in an executable form. Towards this end, we translate this application into an intermediate representation, so called expression trees. Expression trees consist of nodes that are basic operations of a virtual stack machine. Each expression tree represents a computation that corresponds to the execution of the sequence of basic operations obtained when visiting the tree nodes in post x order. A dedicated program examines these expression trees to identify frequently occurring coherent tree fragments. Since each coherent tree fragment represents a particular computation it is possible to assign an opcode to these frequently occurring tree fragments. Figure 1 illustrates the internal structure of the basic compiler. The rectangles represent the dierent stages of the compiler, an oval constitutes an input or output of a compiler stage. Starting with the C source for a certain program the interpretive code and the interpreter are developed. I A compiler (lcc) reads the C source for the program to be compressed. It writes an ASCII image of the Intermediate Representation (IR) expression trees of the program to be compressed. II A program (candgen) reads the subject trees and enumerates their coherent tree fragments. Each tree fragment represents a sequence of operations and corresponds to a potential member (candidate) of the instruction set under construction. When a tree fragment includes a constant, a variant of this tree fragment is generated to infer instructions with immediate operands as well as instructions with hardwired constants, e.g. an instruction that assigns zero to a variable. III A variant of the code generator generator iburg [FHP92a] reads the tree fragments produced above and generates a code generator, which in turn uses tree matching and dynamic programming to compute a least-cost cover for a given input tree. The code generator can use any subset of the tree patterns generated in step II. 6


Unclassi ed Report

822/98

C program to compress I

front end lcc

trees as ASCII II

candidates generator

tree patterns III

lburg

code generator IV

instruction set selector

generated instruction set

V a

compiler

b

interpreter generator

interpretive code

interpreter

Figure 1: Overview compiler IV The code generator of the previous step is used to evaluate dierent instruction sets to determine which instruction set allows the most compact program representation. This is done in several steps: Starting with the necessary base operators, repeatedly the tree pattern that saves the most space is added to the instruction set. This stops when all available opcodes have been used. This instruction set de nes a Virtual Machine (VM) on which the program to be compressed can be executed. V (a) The instruction set derived in step IV is used to translate the C source for the program to be compressed into interpretive code. (b) A nal program accepts the same instruction set and automatically generates a conventional stack-based interpreter that implements the VM.

2.2 The base Virtual Machine The Virtual Machine is a pure stack-based machine. The set of base instructions is almost equal to the set of operators in the IR of lcc. An instruction has one or more operands. An operand can be an implicit operand or an immediate operand. Implicit operands are located on the stack. Immediate operands are located in the code directly after the opcode of the instruction.


7

822/98

Unclassi ed Report

Example 2.1 A few examples of base instructions and their meaning are: CoNSTInteger[#] Push the immediate operand, which is an integer, on the stack. ADDInteger(*,*) Pop two integers from the stack, and push their sum on the stack. ADDRessLocalPointer[#] Push the address of a local variable on the stack. The immediate operand is the oset from the frame pointer. INDIRectInteger(*) Pop the address of an integer variable from the stack and push its value on the stack. A '*' stands for an implicit operand, which is to be popped from the stack. A '#' represents an immediate operand, residing in the code segment in the bytes following the opcode. lcc's compact opcode names have been expanded to make them more readable. The original characters of the opcode name are given in capitals. This will be done throughout this thesis. A complete survey of the instructions can be found in Appendix A. The only changes that were made are those needed to be able to interpret the IR. Two instructions are added to maintain the stack: ADDStackPointer[#] and POP. ADDStackPointer has one immediate operand, which is an integer. ADDStackPointer adds this integer to the stack pointer (sp). POP has no operands. It just pops the rst element of the stack. The POP instruction is only needed because of the way the interpreter is implemented. All opcodes have a size of exactly one byte, which means that the interpreter will be a so called bytecode interpreter. The alignment of the stack-machine has to be equal to the alignment of the hostmachine to be able to pass variables to non-interpreted functions, e.g. library-calls. The only exception is the code segment, both opcodes and immediate operands are byte-aligned. Hence no bytes are lost because of alignment restrictions in the code segment. This combination of alignments makes it possible for the code size to be as small as possible and at the same time to be able to call normally compiled functions because all variables are correctly aligned for the host machine. Both local variables and incoming arguments are addressed by an oset from the frame pointer (fp). See Figure 2 for the layout of a stackframe. The return-instructions perform not only the jump back to the caller, they also restore the stack pointer and the frame pointer. This removes the need for an extra instruction to manipulate the frame pointer, and thus saves an opcode for a superoperator. This also means that only one instruction needs to be emitted instead of three, thus making the resulting code smaller (and faster). Emitting the nodes of the expression trees in post- x order results in executable code for the VM that implements the original C statement. 8


Unclassi ed Report

822/98

incoming arguments return address

Stack growth

old fp fp

local variables sp

Figure 2: a stackframe

Example 2.2 As an example of how C statements are translated into IR trees, look at the statements on lines 5 and 6 of the function in this C fragment. 1 int f(int i) 2 { 3 int j; 4 5 j = i + 1; 6 j = j + i + 1; 7 8 return j; 9 }

The translation of these two statements yields the IR trees displayed in Figure 3, in which the nodes are numbered in post x order. By way of illustration the nodes are also connected in post- x order. The assembly versions of the trees would thus look like this: (a) 1 2 3 4 5 6

ADDRessLocalPointer[-4] CoNSTantInteger[1] ADDRessFormalPointer[8] INDIRectInteger ADDInteger ASsiGNInteger


(b) 1 2 3 4 5 6 7 8 9

ADDRessLocalPointer[-4] CoNSTantInteger[1] ADDRessLocalPointer[-4] INDIRectInteger ADDInteger ADDRessFormalPointer[8] INDIRectInteger ADDInteger ASsignInteger

9

822/98

Unclassi ed Report j = j + i + 1;

j = i + 1;

9

6

ASsiGNInteger

ASsiGNInteger

1

1

8

5

ADDRessLocalPointer[-4]

ADDInteger


4

2 CoNSTantInteger[1]

ADDInteger

5

INDIRectInteger

7

ADDInteger

INDIRectInteger

2

3

4

ADDRessFormalPointer[8]

CoNSTantInteger[1]

INDIRectInteger

6 ADDRessFormalPointer[8]

3 ADDRessLocalPointer[-4]

(a)

(b)

Figure 3: IR trees

2.3 The compiler In this subsection the dierent steps of the compiler are explained in more detail.

Generation of all possible instructions The front end of lcc emits the ASCII image of the IR expression trees. These trees are read by a program named candgen (stands for candidates generator ). This program enumerates tree fragments which are a part of one of the IR trees. An example of a tree fragment is an IR tree with one of the children omitted, this missing child is then replaced by a so called wildcard. The wildcard has a similar role as the * in a UNIX shell command, e.g. ll *.ps.

Example 2.3 Figure 4 contains a few examples of generated patterns with a short description.

10


Unclassi ed Report

822/98

"a constant integer one"

"the sum of the integer operand and a function argument"

CoNSTantInteger[1]

ADDInteger

"the value of a integer local variable"

CoNSTantInteger[#]

INDIRectInteger

INDIRectInteger

ADDRessFormalPointer[-8] ADDRessLocalPointer[4]

"the sum of one and an integer on the stack"

ADDInteger

ADDInteger

CoNSTantInteger[1]

"the sum of the operand and an integer on the stack"

*

CoNSTantInteger[#]

*

Figure 4: Examples of generated patterns The pattern generation algorithm for a tree with two kids is: 1 add the complete tree as candidate 2 add the tree with a wildcard (stack operand) instead of the left tree 3 add the tree with a wildcard (stack operand) instead of the right tree 4 for each constant that appears somewhere in the tree, generate a variant with a wildcard (immediate operand) for the constant 5 repeat this algorithm for the left tree 6 repeat this algorithm for the right tree The algorithm goes in a similar fashion if the tree has any other number of kids. For practical reasons a maximum of one wildcard for a constant is imposed. In this we follow [FP95].

Example 2.4 This is an example of the tree patterns that the tree pattern algorithm generates. The input tree for j = i + 1; (see Example 2.2) is: ASsiGNInteger(ADDRessLocalPointer[-4], ADDInteger(CoNSTantInteger[1], INDIRectInteger(ADDRessFormalPointer[8])))

It generates the following tree patterns (the numbers before the patterns correspond with the numbers in the algorithm). 1 ASsiGNInteger(ADDRessLocalPointer[-4], ADDInteger(CoNSTantInteger[1],


11

822/98

Unclassi ed Report INDIRectInteger(ADDRessFormalPointer[8])))

2 ASsiGNInteger(*, ADDInteger(CoNSTantInteger[1], INDIRectInteger(ADDRessFormalPointer[8]))) 3 ASsiGNInteger(ADDRessLocalPointer[-4], *) 4 ASsiGNInteger(ADDRessLocalPointer[#], ADDInteger(CoNSTantInteger[1], INDIRectInteger(ADDRessFormalPointer[8]))) 4 ASsiGNInteger(ADDRessLocalPointer[-4], ADDInteger(CoNSTantInteger[#], INDIRectInteger(ADDRessFormalPointer[8]))) 4 ASsiGNInteger(ADDRessLocalPointer[-4], ADDInteger(CoNSTantInteger[1], INDIRectInteger(ADDRessFormalPointer[#]))) 5 recursive call for ADDRessLocalPointer[-4] 6 recursive call for ADDInteger(CoNSTantInteger[1], INDIRectInteger(ADDRessFormalPointer[8]))

A total of 21 tree patterns is generated for this example.

Note that only one '*' can appear in a pattern, and that this wildcard can only appear as child of the root-node of that pattern, conform [FP95]. Modern code generator generators such as lburg (used in the lcc compiler) model the instruction set of the target machine (a concrete or virtual machine), by a set of rules that map IR tree fragments on target machine instructions. This set of rules is used to generate a code generator that covers an expression tree with tree patterns. For most target machines, there are several possibilities to cover a tree. The code selection process is controlled by a cost function that is attributed to the code rule. The code generator will always nd the cover that minimizes the cost function, this is called a least-cost tree cover.

Example 2.5 We illustrate the code selection process based on tree patterns matching with a small example. The following rules are part of a machine description for an imaginary stack machine: 12


Unclassi ed Report

1 2 3 4 5 6 7

822/98

pattern

lburg rule

ASsignInteger(*,*) ADDInteger(*,*) INDIRectInteger(*) ADDRessFormalPointer[#] ADDRessLocalPointer[#] CoNSTInteger[#] ADDInteger(CoNSTInteger[#], *)

stmt: stk: stk: stk: stk: stk: stk: coni:

ASGNI(stk,stk) ADDI(stk,stk) INDIRI(stk) ADDRFP ADDRLP CNSTI ADDI(coni) CNSTI

"asgni" "addi" "indiri" "addrfp %a" "addrlp %a" "cnsti %a" "addicon %0" "%a"

1 1 1 5 5 5 5 0

Pattern number 7 needs two lburg rules, the rest of them needs only one rule. Using these rules two dierent tree covers are possible for tree (a) from Example 2.2. Both the cost and the name of the matching pattern are put beside the node(s) the pattern matches, see Figure 5. j = i + 1;

j = i + 1;

asgni(1)

asgni(1)

ASsiGNInteger

ASsiGNInteger

addi(1)

addrlp(5) ADDRessLocalPointer[-4]

ADDInteger

cnsti(5) CoNSTantInteger[1]

addicon(5)


ADDInteger

indiri(1)

indiri(1) INDIRectInteger

CoNSTantInteger[1]

addrfp(5) ADDRessFormalPointer[8]

INDIRectInteger


Figure 5: Two dierent tree covers It is clear that the second cover costs less than the rst, so the second cover is selected. All the generated patterns are emitted as rules for a lburg machine description. We are compiling to minimize for size, so the number of bytes that an instruction would need is used as the cost in the corresponding rule. An instruction needs at least one byte for the opcode. If an instruction has an immediate operand, the size of the type of this immediate operand has to be added to the cost.


13

822/98

Unclassi ed Report

Example 2.6 Here are a few examples of patterns with their corresponding costs: pattern cost ADDInteger(*, CoNSTInteger[1]) 1+0=1 ADDInteger(*, CoNSTInteger[#]) 1+4=5 ADDInteger(*, ConVertCharInteger(CoNSTChar[#])) 1 + 1 = 2 Note that an integer immediate operand needs 4 bytes and a char immediate operand needs 1 byte. Because the execution time of the next phase depends largely on the number of emitted pattern in this stage, some ltering is done to eliminate patterns that are not likely to save any space:

It is possible at this stage to estimate an upper bound on the amount of space a pattern could save. This upper bound is ( ) where

O

S

is the number of occurrences of the patterns and is the number of bytes an occurrence would save if this pattern were the only pattern being added (the maximal savings per occurrence). The maximal savings per occurrence ( ) is equal to the number of bytes it would take to emit the tree using only base operators, minus the cost of the pattern. The pattern is not emitted if this upper bound is smaller than an estimate of the extra bytes needed to add this instruction to the interpreter. O S

S

Example 2.7 This table contains some examples of this value : pattern S

ADDInteger(*, CoNSTInteger[1]) ADDInteger(*, CoNSTInteger[#]) ADDInteger(*, ConVertCharInteger(CoNSTChar[#]))

S

(1+1+4)-1 = 5 (1+1+4) - 5 = 1 (1+1+1+1) - 2 = 2

Only patterns which are smaller than some xed number of nodes (N), typically 5-10 are emitted. We have empirically shown that larger patterns are not very likely to save much space, see also Section 2.4.

Even after this ltering a large number of candidates remain, e.g., when running candgen on the program li with N=4 about 7000 trees are read and 8000 distinct patterns are generated that are not too big. After further ltering 1400 patterns remain. A total of 1500 rules is needed to add the candidate instructions that correspond to these patterns to the base VM machine description. 14


Unclassi ed Report

822/98

Selection of the instruction set In Section 1.3 we decided to choose a byte code interpreter as our basic VM. The base instruction set of our VM speci ed in Appendix A consists of 113 opcodes. This leaves a maximum of 144 opcodes that could be used to encode a candidate instruction. Specialized tree patterns with more nodes and more hard-wired constants save bytes but consume a scarce opcode for a special case. More general patterns need more bytes to do the same work, but might be used more often. How much space it would save to assign one of these scarce opcodes to one of the candidate instructions depends in part on what other instructions are part of this instruction set.

Example 2.8 As an example, suppose an instruction (which we might call addind) with the pattern ADDInteger(*,INDIRectInteger(*)) were to be added to the instructions in Example 2.5. This would means that, besides the two tree covers that were already possible, the tree cover in Figure 6 would also be possible. j = i + 1;

asgni(1) ASsiGNInteger

addindi(2)


ADDInteger


INDIRectInteger


(c)

Figure 6: A third tree cover This cover is more expensive than cover (b) from Example 2.5, and thus the new instruction would not save any space. However, if the instruction addicon wouldn't be part of the instruction set, adding addindi would save space because cover (c) is cheaper than cover (a). It is necessary to identify a subset of 144 candidate instructions from , Xthe generated set of candidate instructions. Theoretically, this means testing 144 instruction sets to identify the one that is optimal with respect to code compaction. Obviously X


15

822/98

Unclassi ed Report ,

200 this is not feasible, for li this would mean that 1400 144 u 10 dierent instruction sets would have to be tested. Instead of testing all possible subsets, a greedy heuristic is used. It starts with the set of base operators and identi es which of the candidate instructions yields the greatest savings in code size. This is done by each time adding one of the candidates to the base set and computing the resulting code size using that instruction set. You could say that it compiles the input program for dierent target machines.. Once the best candidate is known, it joins the instruction set and the process is repeated. It stops when all opcodes are in use or no candidate saves any space. This greedy , X heuristic needs to try ( * 144) dierent instruction sets instead of the original 144 . This still takes a lot of time, therefore: X

X

to prevent the need to recompile the code generator for each try, lburg was

changed in such a way that it generates a code generator that can use only part of its rules. This is done by manipulating cost functions. if a candidate doesn't save any space at a certain run, it will not be tried any more on the following runs. This typically happens when competing instructions have joined the instruction set causing this particular instruction not to be selected by the code generator if it were added to the instruction set at that time. Once an instruction doesn't save any more space, it is very unlikely to do so in the future. The resulting code compaction doesn't depend signi cantly on whether or not such an instruction is unquali ed as a candidate. It is therefore unquali ed and thus prevents a lot of useless computing. In [FP95] a small program is used to recreate the IR trees instead of rerunning the entire front end for each try. This seemed not so simple to implement, so we haven't done it.

Generation of the interpretive code The generation of the interpretive code is done in two stages. In the rst stage an assembly-like version is generated. In the second stage an assembler translates this to byte codes.

The assembly

The assembly is emitted using the same back end as used in the previous step. Only two segments exist, a code segment (sometimes called text segment) and a data segment.

The bytecodes

The assembler emits the interpretive bytecode and some relocation-data. As mentioned before, the code segment is byte-aligned and the data segment is aligned as required by the host-machine. 16


Unclassi ed Report

822/98

Generation of the interpreter An interpreter has to be built that implements the VM. This VM is de ned by the instructions that are selected. The purpose of the research described in this thesis is code compaction, not building an ecient interpreter. The main reason we build the interpreter was to be able to test the interpretive code we generate, and to prove that it can be generated. We kept the amount of time needed for constructing the interpreter to a minimum. No energy or time has been invested in making it fast (or small). Three common interpretation techniques can be distinguished: 1. Classical interpreter with opcode table 2. Direct threaded code 3. Indirect threaded code For an overview of these techniques and some benchmark results, see [Kli81]. The classical interpreter is chosen, mainly because it is the most straightforward to implement. It is implemented in C, not in assembly. The interpreter is basically a switch with at most 256 cases of a few statements each. The total size of a generated interpreter like this is typically 4-8KB. The interpreter has two internal stacks. One stack is for the evaluation of expressions, it will be called the E-stack. The other stack is for keeping the parameters, local variables, return address and the old frame pointer. This stack will be called the I-stack (where the I stands for Invocation). It would be possible to map the I-stack on the E-stack, but because the interpreter is implemented in C it is more straight forward to implement them separately. The I-stack is implemented as an array of char, the E-stack is implemented as union of all possible types that can be on the stack. Some library-routines which are common in many C-programs can be called by the source program. For each of these library-routines a so called envelope is built. This means that each of the selected library-routines is, manually, built into the interpreter. It is quite easy to add other library-routines if needed. This method causes some (time) overhead for both calls to library-routines and 'normal' calls. Of course, since library function are executed directly on the host machine in stead of by the interpreter, the execution of the library function goes just as fast as with normally compiled programs. This method is not very exible, but it allows us to test the benchmark programs.

2.4 Results To determine the compression, the size of the code segment of the interpretive code is measured. This size is compared with the size of the code segment generated by lcc for a native MIPS target machine. The sizes don't include libraries. To be more precise, the size of the code segment of the normally compiled program is determined like this:


17

822/98

Unclassi ed Report

kirk[~] (469) > uname -a IRIX kirk 5.3 11091811 IP19 mips kirk [~] (470) > lcc -c -target=mips-irix yacc.c -o yacc.lcc.o kirk [~] (471) > size yacc.lcc.o Size of yacc.lcc.o:7712 Section

Size

Physical Address

Virtual Address

.text .rodata .data .sdata .options

6064 208 1360 48 32

0 0 0 0 0

0 0 0 0 0

The code segment of the interpretive code is then compared with the size of the text segment, being 6064 bytes in this case. The programs that are used to test the basic compiler are: name

size source origin # lines # kilobytes yacc 600 13 test set lcc, a parser not a parser generator compress 1500 40 SPECint '92 lburg 1600 40 code generator generator used by lcc eqntott 3500 73 SPECint '92 li 7100 153 SPECint '92 (without xleval.c) dcg 16900 653 software for a midiset ecf 17700 751 An embedded program code fragment espresso 13600 337 SPECint '92 The le xleval.c is excluded from li due to problems with consistent compilation across multiple platforms. The dcg and ecf are programs for real embedded systems. Both programs have been made available by Product Divisions of Philips. Figure 7 illustrates the dierence between some compilers on the MIPS-IRIX with respect to code size. Several compilers with dierent optimizations have been tried for some of the test programs. With data (segment) all segments except the code segment are meant. cc yields the smallest size with the -O3 optimization option, gcc with -O2. lcc doesn't have any optimization options. No other options were used. Overall one can say that the total size of code generated by lcc is comparable to the other two compilers using their optimizations. Comparisons with conventional, normally compiled, executables can be tricky because code segments can include some parts of the literal pool that the compressed code omits, and branch tables can appear in data segments when it is more appropriate to charge them to the code segment. These factors don't have a large eect in the experiments reported below. 18


Unclassi ed Report

822/98

lburg 40

10

35

size (in kilobytes)

size (in kilobytes)

yacc

8

30

data code

6 4

25

data code

20 15

2

10

0 cc

cc -O3

gcc

gcc -O2

lcc

5 0

compiler

cc

cc -O3

gcc

gcc -O2

lcc

compiler

li

ecf

100 80

data code

60 40 20 0 cc

cc -O3

gcc

gcc -O2

compiler

lcc

100 90 80 70 60 50 40 30 20 10 0

size (in kilobytes)

size (in kilobytes)

120

data code

cc

cc -O3

gcc

gcc -O2

lcc

compiler

Figure 7: Comparing sizes object les with several compilers In Figure 8 the compression ratios for dierent N, with N being the maximum number of nodes, are plotted. Note that N=0 means that the compiler has the base VM as target machine. Because each compilation can take hours, or even days, this comparison between dierent values of N is only done for the smaller test programs. All compression ratios with N=4 have been computed, see Figure 9. The raw data of these tests can be found in Appendix B. From the results with N=4 we conclude that the basic compiler about halves the code segment.


19

822/98

Unclassi ed Report

Comparing different N's 1.4

compresion ratios

1.2 1 0.8

yacc lburg

0.6 0.4 0.2

10

9

8

7

6

5

4

3

2

1

0

lcc

0

max number of nodes (N)

Figure 8: Compression ratios with dierent N

Results (with N=4) 1.4

compression ratios

1.2 1

lcc N = 0 (base VM) N=4

0.8 0.6 0.4 0.2

espresso

ecf

dcg

li

eqntott

lburg

compress

yacc

0

program

Figure 9: Compression ratios with N=4

20


Unclassi ed Report

822/98

3 Enhancements In the previous section the basic compiler was described. In this section several enhancements to the compiler are described, with their eect. We de ne the eect of an enhancement as a percentage that is determined like this: E

, 100

(1)

F

F

where E

F

= the code size of the test programs produced with a version of the compiler with all enhancements built in except the enhancement of which the eect is being measured = the code size of the test programs produced with a version of the compiler with all enhancements built in

This eect can be considered the net eect of the enhancement in question. Note that a positive eect is an improvement of the compression. We will refer to the compiler with all enhancements built in as the nal compiler. It would take too long to measure the eect of each enhancement for all 1 10, with N being the maximum number of nodes of an instruction. Therefore one value of N is chosen to serve as a reference point and all tests are run with this value of N. Since the compression ratios don't really improve much after N=4, this value is chosen to serve as the reference point. All raw data for the tests in this section can be found in Appendix B. In each gure the number of the table which contains the raw data is placed between braces in the caption. N

The nal compiler

Figure 10 contains the compression ratios obtained with the nal version of compiler. The compression ratios are measured in exactly the same way as they were in Chapter 2. Note that the code sizes of the compiler for the nal base VM are also smaller than they were with the basic compiler for the original base VM.

Enhancements

The enhancements are divided in three subsections: 1. Doing it faster. Two changes to the basic compiler are discussed that shorten the execution time of the compiler. 2. Doing it better. Three changes are discussed which make the algorithm produce smaller code sizes. 3. General space-saving enhancements. Two general space-saving optimizations are discussed. They make the compiler produce smaller code sizes, but don't really have anything to do with the technique the compiler uses.


21

822/98

Unclassi ed Report

Results (with N=4) 1

compressionratio

0.9 0.8 0.7 0.6

lcc N = 0 (stack) N=4

0.5 0.4 0.3 0.2 0.1 espresso

ecf

dcg

li

eqntott

lburg

compress

yacc

0

program

Figure 10: Compression ratios of the nal compiler (Table 9)

3.1 Doing it faster The fact that it takes the compiler quite a long time to compile one program is not acceptable for a production compiler. For a research compiler it is mainly annoying. For example, it takes the basic compiler more than a week to compile ecf on a Pentium 120 running Linux. This is too long, even for a research-compiler. So two changes were made to the compiler which make it run considerably faster, without a signi cant change in the code size. After these changes it still takes the compiler a long time to compile, but it is fast enough to be able to run the test programs. The run-time when compiling ecf with the nal compiler is reduced to about one day on the Pentium 120.

Unqualifying candidates Unqualifying a candidate if it would save less than some predetermined amount of bytes M (M 1), decreases the run time of the compiler considerably. In fact, the basic compiler already ran with M=1. This can however have some eect on the resulting code size. If a pattern is unquali ed that later on might have been selected, the resulting compression is aected. Unfortunately, the amount of space-saving of a candidate doesn't decrease monotonically. Or, to put it in other words, it can happen that a candidate becomes more 22


Unclassi ed Report

822/98

pro table after the selection of some other candidates.

Example 3.1 As an example of a candidate that becomes more pro table, consider the following three candidates: name superop1 superop2 superop3

pattern ADDRessFormalPointer[-8] INDIRectInteger(ADDRessFormalPointer[-8]) ADDInteger(CoNSTInteger[1],INDIRectInteger(*))

Now look at tree I in Figure 11. This tree is matched with only base operators. I

II

j = i + 1;

j = i + 1;

asgni(1)

asgni(1)

ASsiGNInteger

ASsiGNInteger

addi(1)



cnsti(5)

indiri(1) INDIRectInteger

CoNSTantInteger[1]

addi(1)

addrlp(5)

ADDInteger

ADDInteger


superop2(1) INDIRectInteger


III

ADDRessFormalPointer[8]

j = i + 1;

asgni(1) ASsiGNInteger

addrlp(5)

superop3(1)


ADDInteger

CoNSTantInteger[1]

INDIRectInteger

superop1(1) ADDRessFormalPointer[8]

Figure 11: Example of a candidate become more pro table Initially the potential savings of candidate superop1 are four. Suppose what happens if rst superop2 (tree II) and then superop3 (tree III) are selected. The possible savings are rst reduced to zero, but after superop3 is selected the possible savings increase again. Of course, if the tree in Figure 11 was the only tree, superop2 would never be selected.


23

822/98

Unclassi ed Report

Because the possible savings of a candidate don't decrease monotonically the heuristic can't just unqualify a candidate which isn't doing so well. It might do better in the future. Fortunately, this is rare. It is so rare that there is no signi cant eect on the resulting code size if the heuristic is permitted to unqualify candidates which save too little space at a certain time. The only question is how to determine what is too little. Thus what value of M doesn't throw away candidates that would otherwise have been selected, and at the same time throws away as many candidates as possible. This value depends of course on the program which is being compiled. The larger the program, the larger M can be without aecting the resulting code size. The upper limit for M is the amount of bytes that the last instruction will save, minus one. This value can be determined by running the compiler once with a conservative estimate of M (e.g. 1 or even 0). Subtracting the maximum of ve and 30% of the upper limit of M, from the upper limit is usually a good value for M. Figure 12 shows the eect of dierent M's for the test programs yacc and lburg. The value of M normally used for yacc is 1, for lburg is 4.

Unqualifying (yacc) 1% 0% -1%

0

1

2

3

4

5

effect

-2% -3% -4% -5% -6% -7% -8% -9%

value of M

Unqualifying (lburg) 0.50% 0.00%

effect

-0.50%

1

3

4

5

7

9

11

-1.00% -1.50% -2.00% -2.50% -3.00%

value of M

Figure 12: Eect of unqualifying (Table 2) 24


Unclassi ed Report

822/98

Faster heuristic The run time of the compiler depends linearly on the number of times an instruction set is tried. As already explained in Section 2.3 the possible savings of a candidate depend on what other candidates have been selected. This means that the amount of savings of a candidate has to be recomputed every time a new candidate is selected, because the amount of savings might have been in uenced by the selection of the new candidate. Although this is certainly true, the change in potential savings is usually not very big. The optimization described here uses this. The selection of new instructions takes less time if the heuristic only has to try all candidates once in a while instead of every time it selects a new instruction. The changed heuristic (called circle ) works as follows: 1. 2. 3. 4. 5.

Try all candidates, and sort them in order of decreasing pro t. Take the best X candidates, this subset is called S. Select the best candidate. Try out all candidates which are part of set S, select the best one. If the number of candidates in set S hasn't dropped under some minimum Y, repeat the previous step. 6. Repeat steps 1-5 until all candidates have been selected. The assumption is that, if X and Y are chosen wisely, the instructions that would have been selected by the original heuristic, will almost always be in set S. When X is equal to 10% of the initial number of candidates, and Y=X/2 the resulting code size is not signi cantly dierent from the code size of the original heuristic. These values of X and Y are used for all tests. Figure 13 show the eect of using the circle heuristic for a few of the test programs. The eect is almost zero. Circle 0.20% 0.15% 0.10%

effect

0.05% 0.00% yacc

compress

lburg

-0.05% -0.10% -0.15% -0.20% -0.25%

programs

Figure 13: Eect of circle (Table 3)


25

822/98

Unclassi ed Report

3.2 Doing it better Three ways to improve the code compaction have been implemented: 1. Adjusting the base set of operators 2. Generating more candidates 3. Trimming the set of selected candidates

Adjusting the base set of operators The base set of operators, as it was used in the basic compiler, was (almost) equal to the set of IR operators of lcc. This set was not developed for the purpose of being a compact base set of operators for a stack based VM. Several changes have been made to the base set.

Omitting base operators

The compiler bene ts from a smaller base set of instructions in two ways: 1. The trees will be more alike. The fewer dierent operators, the fewer dierent trees. The selected superoperators will be able to match more expression trees. 2. More opcodes are available for superoperators.

Instructions that can be omitted can be divided in three categories: 1. Instructions that have an empty implementation in the interpreter This category contains some of the conversion operators, e.g. the operator to convert a (signed) integer to an unsigned integer. 2. Instructions that have an identical implementation in the interpreter. The VM has four types which have a size of four bytes; (signed) integer, unsigned integer, pointer and oat. In most cases the implementation of an operator for the dierent types is equal, e.g. in case of the assignment-operators. An other example is the ADDRessLocalPointer and ADDRessFormalPointer, both are an oset to the frame pointer. 3. Instructions that can be implemented by other instructions. This are some of the conditional jump instructions. E.g. a jump if greater than can be implemented with a jump if less than by swapping the two kids. Operators which can be implemented by more than one other have also been tried, but that approach required a lot more changes in lcc. This table lists all operators that have been omitted: 26


Unclassi ed Report

omitted operator

ADDRessFormalPointer CoNSTPointer CoNSTUnsigned ConVertIntegerUnsigned ConVertPointerUnsigned ConVertUnsignedInteger ConVertUnsignedPointer INDIRectPointer INDIRectFloat ADDInteger ADDPointer LeftShiftInteger SUBtractInteger SUBtractPointer ASsignPointer ASsignFloat GreaterEqualInteger GreaterEqualUnsigned GreaterEqualFloat GreaterEqualDouble GreaterThanInteger GreaterThanUnsigned GreaterThanFloat GreaterThanDouble ARGumentPointer ARGumentFloat RETurnFloat RETurnDouble

822/98

implemented by

ADDRessLocalPointer CoNSTInteger CoNSTInteger

INDIRectInteger INDIRectInteger ADDUnsigned ADDUnsigned LeftShiftUnsigned SUBtractUnsigned SUBtractUnsigned ASsignInteger ASsignInteger LessEqualInteger LessEqualUnsigned LessEqualFloat LessEqualDouble LessThanInteger LessThanUnsigned LessThanFloat LessThanDouble ARGumentInteger ARGumentInteger RETurnInteger RETurnInteger

category II II II I I I I II II II II II II II II II III III III III III III III III II II II II

Constants

Just like in the other back ends of lcc, no CoNSTantFloat and CoNSTantDouble operators are used. Instead, when a constant is needed of the type oat or double, it is put in the data segment as a literal. This is done to prevent diculties when compiling on a machine that treats oats dierent than the machine the compiled program will run on. This has as a consequence that the base set of operators contains two operators less. One minor change was made to the way (the front end of) lcc deals with common subexpressions. lcc creates for all common subexpressions in a tree a temporary to which the common subexpression is assigned. The only exceptions that are made are for the ADDRessLocalPointer and ADDRessFormalPointer operators. This means that lcc also creates a temporary for all common subexpressions which are just constants. Assigning a common subexpression to a temporary when that common subexpression is just a constant, takes in almost all cases more bytes than just repeating the constant-operator. Therefore the front end of lcc is


27

822/98

Unclassi ed Report

modi ed so that it won't create tempories for common subexpressions which are just constants.

Adding new types

A lot of the immediate operands of the operators

CoNSTInteger[#]

and

ADDRessLocalPointer[#]

are quite small. Most of them can be put in one byte, some have to be put in two bytes. Only in a minority of the cases all four bytes are really necessary. Therefore two new types are introduced:

IntegerEight, an integer represented in 8 bits. This means that all integers i, with ,128 128 can be represented with one byte instead of four. IntegerSixteen, an integer represented in 16 bits. This means that all integers i, with ,32768 32768 can be represented with two bytes instead of four. i

Compact Code Generation through Custom Instruction Sets - CiteSeerX

Compact Code Generation through Custom Instruction Sets - CiteSeerX

Suggest Documents

Custom Instruction Sets for Code Compression

Custom Instruction Sets for Code Compression - Semantic Scholar

Fast Custom Instruction Identification by Convex ... - CiteSeerX

Custom Instruction Generation Using Temporal ... - IFIP Digital Library

Chebyshev Compact Sets in the Plane - CiteSeerX

Static Resource Models of Instruction Sets - CiteSeerX

Generation of Compact Single-Detect Stuck-At Test Sets Targeting ...

Custom Code - sapevent.ch

Auto-vectorization through code generation for ... - People.csail.mit.edu

Automatic Verilog Code Generation Through Grammatical Evolution

Bitwidth Sensitive Code Generation in a Custom Embedded

Automatic Code Generation Through Model-Driven Design

automatic data path generation from c code for custom processors

Code generation: evaluating polynomials - CiteSeerX

Aspect-oriented Code Generation - CiteSeerX

AUTOMATIC CODE GENERATION FOR ... - CiteSeerX

Table-Driven Code Generation - CiteSeerX

Code Generation for a Dual Instruction Set Processor Based on ...

Instruction scheduler generation for retargetable compilation - CiteSeerX

ARM Instruction Sets and Program

Interoperability with multiple instruction sets

Interoperability with multiple instruction sets

Microprocessors and Instruction Sets - mcamafia

diesel generating sets instruction manual