[ABCC88] F. Allen, M. Burke, P. Charles, R. Cytron and J. Ferrante. An overview of the PTRAN analysis system for multiprocessing. Journal of Parallel and ...
A Uniform Internal Representation for High-Level and Instruction-Level Transformations Eduard Ayguadé, Cristina Barrado, Jesús Labarta, David López, Susana Moreno, David Padua, and Mateo Valero Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona
1 Introduction Compiler techniques for automatic detection of parallelism are often described in the literature and implemented as source-to-source transformations [Wolf82, Zima91, BENP93]. These implementations are usually experimental translators for the parallelization and vectorization of loops [ABCC88, PGHL89, BEKG94]. Low-level operations, such as those needed to compute addresses of arrays and scalar parameters, are usually hidden for these source-to-source translators which only represent internally highlevel constructs. Most compilers for high-performance machines include two separate translators: a sourceto-source translator for the automatic detection of parallelism and a back-end compiler that generates machine instructions. This division of labor usually implies duplication of effort and makes it difficult for these traslators to interact. Some analysis and transformation algorithms may be needed by both translators. One reason is that the internal representation is normally different in each translator and therefore much of the code for the analysis and transformation algorithms has to be different. Also, the two translators are usually written by different groups of people, possibly at different times and little effort is made to share the analysis and transformation routines. Data dependence analysis is an example of algorithm that may be needed by both translators. The source-to-source translator needs dependence analysis to detect loop-level parallelism and the back-end needs it to detect instruction-level parallelism. Another example is induction variable recognition and removal which is useful in the source-tosource parallelizer to increase parallelism and facilitate dependence analysis. In the backend compiler, this transformation has to be applied again to identify and remove the redundant induction variables produced by strength reduction. The need for interaction arises when one translator needs information that is naturally available to the other. We will discuss next three classes of interaction. In the first class, the information computed by the source-to-source translator is needed by the back-end compiler. All that is needed in this case is to pass the information together with the transformed program. The interaction becomes more difficult, but still feasible, in the second class of interaction where information from the back-end compiler is needed by the source-to-source parallelizer. For example, the exact machine code translation of the loop body is useful to determine the execution time of the loop body which the parallelizer has to estimate when deciding the degree of blocking or unrolling needed to minimize the 1
effect of the scheduling overhead. One simple way to obtain this information, albeit not very efficient, is to invoke the back-end compiler to determine the exact form of the object code and then feed this information back to the parallelizer. The division of labor becomes an insurmountable obstacle in the third class of interaction where two transformations which are performed separately by each translator need to be consolidated into a single transformation. For example, when translating a loop into parallel form, it seems better to apply a single transformation that takes into account both loop-level and instruction-level parallelism. In this paper we describe a strategy that will make it possible, after applying a small number of changes, to represent low-level operations as part of the internal representation of a conventional source-to-source Fortran translator. Briefly, our strategy is to represent the low-level operations as Fortran statements. In this way, all the transformation and analysis routines available in the source-to-source restructurer can be applied to the lowlevel representation of the program. The source-to-source parallelizer could then be extended to include many traditional analysis and transformation steps, such as strength reduction and register allocation, not usually performed by this translator. The generation of machine instructions is done as a last step by a direct mapping from each Fortran statement onto one or more machine instructions. The source-to-source restructurer is therefore extended into a complete compiler as shown in Figure 1. All transformations, including high-level parallelization and the traditional scalar optimizations, can now be performed in a unified framework based on a single internal representation. One additional advantage of representing the low-level operations as Fortran statements is that the outcome of each transformation, both high and low level, can be trivially transformed into a Fortran program that could be executed to test the correctness of the transformation. Another approach that also uses a uniform representation for both high-level parallelization and scalar optimizations was the one followed in the IBM Fortran compiler [ScKo86]. The main difference with our approach is that this compiler evolved from a traditional back-end compiler which was extended to do some of the high-level transformations usually performed in other systems by a source-to-source translator. The Stanford University Intermediate Form (SUIF) [TWLP91] has also a similar goal as the representation described here. Again the difference is that SUIF extends low-level operations with annotations and many of the transformations have been implemented by extending low-level passes so that they can recognize the high-level annotations. In our case the objective is to modify, in the simplest possible way, a source-to-source translator so that it can also deal with low-level operations. The rest of this report describes the internal representation proposed. We start, in Section 2, with an overview of the constructs used. In Section 3, we discuss the representation of assignment statements including the operations needed for address computation. Section 4 presents the representation of subroutine and function calls. Control structures such as do loops and if statements are discussed in Section 5. In Section 6 we discuss some code generation strategies and in Section 7 we present a number of research topics which could profit from the incorporation of our internal representation in a traditional source-tosource restructurer. 2
Fig. 1. Block diagram of source-to-source restructurer. Source Program
Source-to-source restructurer Transform to low level form
High-level parallelization and scalar optimizations
Machine code Generation (Direct Mapping)
Object Code or Assembly Language Program
2 Overview of the Proposed Internal Representation The low-level representation of a Fortran program will contain only four control structures: do and while loops, if-then-else statements, and regular gotos. The other control structures are transformed onto these using relatively simple strategies. For example, arithmetic ifs are transformed into if-then-else statements and assigned gotos and computed gotos are transformed into a sequence of if statements. Also, it is possible to transform the if and do structures into more elementary operations. For example, a do loop may have to be transformed into a sequence of statements that initialize, increment, and test the value of the index variable. However, we believe that in most cases efficient machine code can be geenerated just by expanding the high-level control structures into a pre-defined sequence of instructions. This is described in more detail in Section 5. Assignment statements are represented as a sequence of elementary assignment statements including those needed to compute the address of array elements. The initial sequence of elementary assignment statements uses virtual registers. There are four classes of virtual registers, one for each basic Fortran type (integer, real, double precision, and logical) and there is an unlimited number of virtual registers in each class. These registers are represented as Fortran scalar variables and each elementary statement 3
represent a load, a store or a register-to-register operation. Register allocation would replace virtual registers with physical registers and woudl insert spilling code. References to scalar variables or array elements are represented as pairs. The first component is the reference as it appears in the source program and the second is a baseplus-displacement representation. For example, consider the statement a(i,j)=b(i)+1.0 There are several ways to translate this statement depending on the characteristics of the target machine and the storage allocation of the variables. These will be discussed in detail in the next section. For now, let us consider the following translation: DATA c1 /1.0/ RIS1 = hi(loc(b)) RIS2 = RIS1+(lo(loc(b))-4) RIS2=virtual origin of b RIS3 = [i,mem(loc(i),RIS0)] RIS4 = RIS3 * 4 RIS4=displ. to element (i) RFS1 = [b(i),mem(RIS2,RIS4)] RIS5 = hi(loc(c1)) c1 is the real constant 1.0 RIS6 = RIS5+lo(loc(c1)) RFS2 = [c1 ,mem(RIS6,RIS0)] RFS3 = RFS1 + RFS2 b(i)+1.0 RIS7 = hi(loc(a)) RIS8 = RIS7+(lo(loc(a))-44) RIS8=virtual origin of a RIS9 = [j,mem(loc(i),RIS0)] RIS10 = RIS9 * 4 RIS11 = [i,mem(loc(j),RIS0)] RIS12 = RIS11 * 40 RIS13=disp to element (i,j) RIS13 = RIS10 + RIS12 [a(i,j),mem(RIS8,RIS13)] = RFS3 Here the RISn variables represent fixed point virtual registers and the RFSn floating point virtual registers. The value of the register RIS0 is always zero. The functions hi(loc(x)) and lo(loc(x)) are respective the higher and lower halfs of x’s memory address. The allocation of a variable or array may be static or automatic (on the stack) depending on a compiler switch. Mem(RISn,RISm) represents the contents of the memory location with address RISn+RISm and Mem(loc(var),RISm) that of address(var)+RISm In pairs of the form [,mem(...)] both arguments represent the same memory address. The first component, is useful to compute data dependences and for symbolic analysis of the program. The mem(...) expression can be used to generate machine code and to identify what addressing values are used.
3 Assignment statements In this section we describe the representation of assignment statements. One of our objectives is to show that we can represent several RISC instruction sets in our internal representation. 4
Assignment statements can take one of two forms: 1. The righ-hand side is an unary or binary expression. The left-hand side is always a register and the source operand(s) are either registers or immediate values. 2. A memory access, where one operand is on a register and the other represents a memory location (a program variable or constant) or in some cases an immediate value. The memory location is represented as the pair [, memory address]. All Fortran assignment statements can be translated into sequences of assignments in one of these two forms except for those containing function calls. The representation of function call is discussed below.
3.1 Registers Initially the Fortran statements are translated to low level form using an infinite number of virtual registers which is the approach used by many compilers. When the program has to be mapped to a particular architecture then the virtual registers are mapped to physical registers and spill code inserted when the number of physical registers is not sufficient. Virtual and physical registers can be declared as loop private variables to avoid the generation of spurious dependences. The names of the virtual registers in our representation begin with “R” followed by a sequence of letters representing their type and a number to distinguish between the different registers of the same type:
R TS nn number size {Single, Double, Quadruple, No applicable} type {Integer, Float, Logical} Register
Memory addresses integer values and therefore an address can be stored in an integer register. We do not represent complex numbers directly. Complex number operations are represented as a sequence of floating point operations on the real and imaginary parts. Also, characters are manipulated through pre-defined function calls. The names of the physical registers are machine-dependent. Most machines have special instructions that implicitly work with a value, as the test instruction that compares with 0. To represent this, the value of all registers RTS0 is assumed equal to zero when T is either I or F , and to .FALSE. when T is L. Also we have a special register, FP, for accessing data in the stack. See section 3.3 for details. 5
3.2 Elementary Operations In an assignment statement whose rhs is an operation, source and destination operands are all of the same type, except for some comparison operators where the destination has to be of type logical. When needed, casting is inserted explicitly using subroutine calls. Suppose the following statement, where a is a real variable and i an integer variable: a=i One possible translation is RIS1 = loc(i) call reali_ic (RFS1,[i,RIS1]) [a,mem(loc(a),RIS0)] = RFS1 The subroutine reali_ic transforms the integer first parmeter into floating point form and returns the result in the first parameter. We now discuss the unary and binary operations. We group them as arithmetic and logical. The arithmetic operators are: “+”, “-”, “*”, “/”, and “mod”. They correspond to machine instructions found on most machines. The operator “-” can be unary and binary. Internally these operands are classified as real or integer, depending on the type of operands. For integer operations, one of the source operands may be an immediate value. In our present implementation, the maximum size in bits of the immediate operand is the value of the external switch immediate_size, The type of the immediate operand is determined by the external switch immediate_type. When immediate_type is 0, the immediate operand is an unsigned integer otherwise it is a signed integers. If the target architecture contains more advanced instructions, such as add&multiply or autoincrement addressing mode, two consecutive statements can be coupled, indicating that both instructions can be mapped into a single machine instruction. The coupling is done using a new type of link called coupling link. Coupled statements cannot be separated by the translator (for example, invariant removal can not remove only one of two statements coupled). Any other Fortran operator is translated into a sequence of statements or into a subroutine call. For example, the power operator “**” is translated into a POWt_IC subroutine call (t indicates the type of the result). Many architectures have some of these more complex instructions microcoded. Also, the run-time library have some of these functions implemented in a very efficient way. The logical operators are: “.EQ.”, “.NE.”, “.GT.”, “.LT.”, “.GE.”, “.LE.”, “.AND.”, “.OR.”, “.NOT.”, “.EQV.” and “.NEQV.”. The left hand side of assignment statements containing logical operators is always a logical register. Also, the operand(s) are boolean for “.AND.”, “.OR.”, “.NOT.”. The operands for the other operators may be of any type as long as they are the same. For integer source operands, small immediate values are 6
allowed. If the boolean constants “.FALSE.” and “.TRUE.” are presented in a boolean expression, they can be eliminated using peep-hole optimizations.
3.3 Memory accesses An assignment statement can be a memory load or store. The destination, in the case of a load, and the source, in the case of a store, are always either registers or an immediate operand as discussed in Section 3.2. The other operand is always a memory address od the form [, memory address]. The memory address has two components: a base and a displacement as dicussed next. 3.3.1 Memory access in three RISC machines The three target machines we have studied (MIPS, HP-PA and SPARC) use the base-plusdisplacement addressing mode. MIPS uses as displacement a small constant and the base address is stored in a register. HP-PA and SPARC are more flexible and allow constant and variable displacements. MIPS:
small_const(reg)
HP-PA:
small_const(reg)
SPARC:
[reg+small_const]
or
reg(reg) or
[reg+reg]
Depending on the type of the operands, code generation is done as follows: 1) Constants. We may have different types of constants: logical, integer, real and character. Logical and small integer constants fit in the instruction itself. Thus no extra accesses to memory are needed. Larger integer constants are loaded using two consecutive instructions as real constants. The largest constant that fit in one instruction is machine dependent and is specified with the immediate_size flag. Real and character constants are stored statically in memory and a load operation must be done before they can be used with the only exception of 0.0, that is represented by the register RFS0. Their addresses are constant and known at compile time. The Fortran compilers of the HP and MIPS generate the same operations for loading a static constant into a register: reg = hi(address) reg = lo(address) + reg float_reg = mem(reg,0) Because the instructions and addresses are both one word wide, two instructions are needed to load a register with an absolute memory address. The last instruction loads the value of the constant. SPARC generates a similar sequence, but it fuses the last two instructions by using the base-plus-displacement addressing mode: float_reg = mem (reg+lo(address)) A coupling link may be used in our scheme to represent this addressing mode. 7
2) Common variables. The variables declared in a common block are statically stored in the data segment. The access to all variables in the same common block is done through the base address of the common block plus the constant displacement to the variable. Both values are known at link time. Therefore, common variables are accessed in the same way as float constants. First, a register is loaded with the address (two instructions) and then the value is accessed through that register. 3) Local variables. The allocation of space for a local variable is sometimes done statically by the compiler. In this scheme recursion is not allowed. Some compilers have a switch to select between static and dynamic allocation. When the allocation of space is static, the access to local variables is done in the same way as the access to real constants and common variables. Static storage can be selected by the programmer with the SAVE instruction. If the local variables are allocated in the stack, then only their displacement relative to the frame pointer is known. The access can be done in only one instruction using the base-plus-displacement addressing mode where the base is the frame pointer register. Again, only one instruction is needed to load the address of a dynamic local variable. 4) Formal parameters. In Fortran, the parameters are always passed by reference. The address of the parameters can be passed in static memory, in the stack or in registers. The Fortran compilers of the three machines mentioned above generate code so that the addresses of arguments are passed in a set of registers. When the number of parameters is greater than a predefined number of registers, then the rest of the parameters are passed through the stack. If the called subroutine calls another subroutine or function, then the registers used for parameter passing have to be saved in the stack and any forward access is done through the stack. For the innermost functions, the access to the first predefined set of parameters is done directly through the registers that store their addresses. Otherwise, the access to a parameter is done by first loading its address from the stack (one instruction) and then accessing the parameter itself. The returned value of a function is always stored in a register. 5) Array elements. Array elements are accessed using the base-plus-displacement addressing mode. Suppose the access to the element A(i,j), where A is an array NxM, with limits (r1...r1+N-1, c1... c1+M-1), stored by columns and i and j are unknown at compile time. The offset from the origin address of A to element (i,j) is equal to (N*(j - c1) + (i r1)) * size(elements of A). If the expression is simplified to a normal additive form, it is transformed to a linear function of n variables plus some independent terms. The value of the independent terms can be computed at compile time. This value is called the virtual origin (VO) of the array. In the example the VO is (- N*c1 - r1) * size(elements of A). When accessing an array element the compiler tries to compute statically as much of the address as possible. We can basically divide the address in three parts: the variable terms, the base address and the VO. • The variable terms can be computed only during execution. The coefficients of these terms are integers known at compile time. Most of the compilers transform multiplication into a sequence of shift and add instructions which is faster than 8
multiplications. Also, when computing the displacement some compilers distribute the size factor over the multiplication, while others do not multiply until all the terms are added. When the array is sequentially referenced within a loop, most compilers use strength reduction to transform multiplications into a sequence of additions. • Virtual Origin. The virtual origin VO can be computed at compile time since all the information needed about size of array dimensions is known. It is always part of the displacement of the base-plus-displacement memory address. • The base address can be known or unknown at compile time. If the array is part of a common block or if it is a SAVEd local variable stored in the data segment then it is known. In this case the VO and the base address are statically added to form the VO-base. This is one-word-sized, so it can not be used as the displacement part of the address. The VO-base is loaded into a register with two instructions: first loading the high part of the address, and then the low part plus the VO. When the array is a formal parameter or a local variable in the stack then the base address is not an absolute value. For matrices allocated in the stack the VO and the displacement from the frame pointer are statically added to form the VO-displacement. When the matrix is a formal parameter and its address is already in a register then the scheme used may differ: HP adds dynamically the displacement and the VO and stores it in the register that forms the displacement; SPARC computes dynamically the displacement in a register, and then adds the VO to the register with base address; MIPS adds the displacement and the base address to form the base, and the VO is the constant displacement. 3.3.2 Decisions In our internal representation the memory addresses are a pair of the form [x,mem(...)]. The high level notation x is used to perform the data dependence analysis. The low level notation is the memory address expressed in base-plusdisplacement addressing mode. The base will always be stored in a register, but the displacement may be a register or a constant, depending on the flag reg_reg (see below). We define a set of switches to inform the compiler about all the machine specific aspects and conventions of the target machine: reg_parameters
#
param_storage loc_storage displacement_size reg_reg
0/1 0/1 # 0/1
the first # parameters are passed in registers, and the rest depending on param_storage flag parameter storage is: 0=automatic, 1=static locals storage is: 0=automatic, 1=static number of bits of constant displacement addressing mode exists: 0=no, 1=yes
We will generate five different schemes for accessing a variable. Table 1 shows the statements generated for the load of a scalar variable v and for the load of an element i of 9
array v for each of the five accessing cases. The general case of an access to an array element when the subscript is constant is considered as a scalar access. The access to real constants is considered also as a scalar access because a new variable that holds its value has been declared in a DATA statement. Type of variables / switches
Access to v
Access to v(i)
(1) STATIC
RIS1 = hi (loc(v))
RIS1 = hi (loc(v))
loc_storage = 1
RIS2 = RIS1 + lo (loc(v))
RIS2 = RIS1 + (lo (loc(v))+ VO)
RFS1=[v,mem(RIS2,RIS0)]
RIS3 = ...offset to elem i.. RFS1=[v(i),mem(RIS3,RIS2)]
(2) AUTOMATIC
RFS1=[v,mem(disp(v),FP)]
loc_storage = 0
RIS1 = ...offset to elem i... RIS2 = RIS1 + (disp (v)+ VO) RFS1=[v(i),mem(RIS2,FP)]
(3) PARAMETERS
RFS1=[v,mem(loc(v),RIS0)]
RIS1 = ...offset to elem i...
(3.1) on register
RIS2 = RIS1 + VO
reg_parameters #
RFS1=[v(i),mem(loc(v),RIS2)]
(3.2) static
RIS1 = hi (loc(p_subrv))
RIS1 = hi (loc(p_subrv))
param_storage = 1
RIS2 = RIS1 + lo (loc(p_subrv))
RIS2 = RIS1+lo (loc(p_subrv))
RIS3 = mem(RIS2,RIS0)
RIS3 = mem(RIS2,RIS0)
RFS1=[v,mem(RIS3,RIS0)]
RIS4 = ...offset to elem i... RIS5 = RIS4 + VO RFS1=[v(i),mem(RIS3,RIS5)]
(3.3) automatic
RIS1 = mem(disp(v),FP)
RIS1 = mem(disp(v),FP)
param_storage = 0
RFS1=[v,mem(RIS1,RIS0)]
RIS2 = ...offset to elem i... RIS3 = RIS2 + VO
RFS1=[v(i),mem(RIS3,RIS1)] TABLE 1. Code generation for a load of v on RFS1
Case (1): For static storage (commons, data, characters, real constants) we have to introduce two new functions: high and low, which return the hi and low part of a memory address. The split up of the loading of the address into two instructions is done because it is the way it is also done in assembler. When optimizations and scheduling are applied, the two instructions may be affected differently and better executions may be reached. For accessing arrays RIS0 is substituted by the register that holds the offset to the element. The expression (lo(loc(v))+VO can be computed at compile time. Case (2): When the variable is in the stack then it is accessed in assembler through the frame pointer (FP). If it is a scalar then there is no need to load the address of the variable into any register since we can access it through the frame pointer and a displacement. We need a new function: disp, which returns the displacement from the frame pointer to the variable. If the variable is an array then the offset to the element is hold in a register, disp(v) returns the displacement from the frame pointer to the origin of the array v (this 10
is a constant known at compile time). Thus the expression disp(v)+VO can be computed at compile time. Cases (3): When v is a parameter the subroutine receive its address. When v is an array the VO must be added to the displacement during execution. In case (3.1) the address of v is in a register and the access can be done directly using this register. Before register allocation we do not know which register holds the address, thus we use the generic function loc(v) instead of the register; after allocation this should be substituted by the register. In cases (3.2) and (3.3) the address of the variable v is in memory. An access to it needs two accesses to memory: First to load the address and second to access the value. The loading of the address has no Fortran identification associated because we know that this variable is read-only and thus there is not data dependence. In case (3.2) we generate a common block for each subroutine; the variables inside the common block hold the addresses of the parameters (in table 1, param_subrv is the component of the common block for subroutine subr that holds the address of v). The variables inside the common block need the two extra instructions, hi and low, to load them. In case (3.3) the address of the parameter v is in the stack and it can be loaded with only one mem instruction, as in case (2). For instance, consider the following Fortran assignment statement: M(i) = 3 If i and M are local automatic variables, using scheme (2) (flag loc_storage=0) the statement is translated into: RIS1 = [i,mem(displ(i),FP)] RIS2 = RIS1 * 4 RIS3 = RIS2 + (displ(M) - 4) [M(I),mem(RIS3,FP)] = 3 If i is a local automatic variable (2) (flag loc_storage=0) and M is a common static variable (1), the statement is translated into: RIS1 = hi(loc(M)) RIS2 = RIS1 + (lo(loc(M)) - 4) RIS3 = [i,mem(displ(i),FP)] RIS4 = RIS3 * 4 [M(i),mem(RIS2,RIS4)] = 3 If there are only two parameters (M,i) and flag reg_parameters=1, then M is passed through a register (3.1) and i passed trough the stack (3.3). Therefore, the statement is translated into: RIS1 = mem(displ(i),FP) RIS2 = [i,mem(RIS1,RIS0)] 11
RIS3 = RIS2 * 4 RIS4 = RIS3 - 4 [M(i),mem(loc(M),RIS4)] = 3
4 Subroutines 4.1 Pass of parameters. Parameters can be of any Fortran type. If the type of the actual parameter does not match with the type of its formal parameter, the compiler will make a coercion of both types, whenever the type of actual parameter was allowed. In Fortran, parameters are always passed by reference. Hence, the arguments can have been modified when returning from the call. When an actual parameter is an expression, a temporal variable has to be generated to hold the expression value and its address is passed to the subroutine. Assembler languages perform a subroutine call in two steps: pass of parameters and jump to the subroutine. The pass of parameters can be done via stack, registers or static storage. We generate some code preceeding the call instruction where the addresses of the actual parameters are computed. The actual parameters in the call statement are substituted in our internal representation by pairs [var, RISn], [var, TMISn] or [var, p_subrn] depending on the pass of parameters used. In all the cases, RISn, TMISn and p_subrn hold the address of the variable var. For example, the Fortran source code: CALL SUBR(X,Y+10) assuming pass of parameters by register (flag reg_parameters=2) is represented internally as: CALL SUBR ([X,RIS1],[TMF1,RIS2]) Before the subroutine call the temporal variable TMF1 is loaded with the value of the expression Y+10 and registers RIS1 and RIS2 with the addresses of variables X and TMF1, respectively.
4.2 Return of result Assembler languages perform a function call in three steps: pass of parameters, jumping to the function and obtention of the result. The pass of parameters and the call are done as for subroutines. The return of results is usually done in registers, one or more depending on the size of the returned operand. Transforming all functions to subroutines is a design decision. The reason is that a Fortran function may return a COMPLEX value and we have translated COMPLEX variables to REAL arrays of two dimensions and functions cannot return two values. 12
Therefore, the declarations of functions are transformed to subroutines, and the returning values are loaded on destination registers that are included at the beginning of the parameter list: In example, the following Fortran source code: complex FUNCTION func(c,a) complex c real a { accesses to func } func= {new value} return end is represented internally as: SUBROUTINE func(RFS1,RFS2,c,a) real c(2) real a real RFS1,RFS2 { accesses to RFS1 and RFS2 } RFS1={new value} RFS2={new value} return end The expression that contains the function call has to be modified. The function call is converted to a subroutine call where the first parameter is one or two registers that hold the returned value. The function call will be replaced by the use of these registers. In example, he following Fortran source code: c1=func(c2,x)
;c1 and c2 are complex ;x is real
is represented internally as: REAL c1(2),c2(2) REAL x ... RIS1=loc(c2) RIS2=loc(x) CALL func(RFS1,RFS2,[c2,RIS1],[x,RIS2]) [c1(1),mem(loc(c1(1)),RIS0)]=RFS1 [c1(2),mem(loc(c1(2)),RIS0)]=RFS2 if c1,c2 and x are in the stack.
4.3 Library subroutines. The above transformation can be done in the call and in the declaration of the functions when they are compiled by our restructurer. However, many library functions and intrinsic functions are already compiled and their code is not available by our compiler. In this 13
case, there will be a conflict between the invocation to these functions that are now subroutine calls and the declaration of them that are still functions. In order to solve this problem, it is necessary to create an interface library between the calls and the functions. In example, the following function call: x = SIN(y)
; intrinsic function that belongs to the language library
Assuming pass of parameters done by register and x and y automatic local variables, it is transformed to: RIS1 = displ(y) + FP CALL SINR_IC(RFS1,[y,RIS1]) [x,mem(displ(x),FP)]=RFS1 The interface library would contain the following definition of subroutine SINR_IC: SUBROUTINE SINR_IC(RFS1,t) REAL t,RFS1 RFS1=SIN(t) RETURN END
5 Control Structures In this section we describe source-to-source transformations of basic control structures, in order to simplify the code and make it closer to the typical machine code of a RISC architecture.
5.1 IF statements There are three kinds of IF statements in Fortran: arithmetical IF, logical IF and IF-THENELSE. In machine code, an IF statement is translated to an evaluation of the condition and a conditional branch with just two possibilities: continue sequentially or jump to another place. We translate the IF statements as close as possible to this philosophy. 5.1.1 Arithmetical IF Arithmetical IF has the following syntax: IF (exp) l1, l2, l3 where exp is an integer or real arithmetical expression and l1,l2,l3 are valid labels. The meaning of this statement is: if the result of the evaluation of the expression is less, equal or greater than zero, then the control goes to l1, l2 or l3, respectively. 14
Our transformation consists on: first, the arithmetical expression is transformed when necessary as shown in section 3, then the register that holds the result is used in IF-THEN statements. In example, consider the statement IF (A(I)-1.0) 100, 200, 300 Assuming that A(I)-1.0 has been stored in register RFS1, the conditional statement is represented internally as: RLN1=RFS1.LT.RFS0 IF (RLN1) THEN GOTO 100 ELSE RLN2=RFS1.EQ.RFS0 IF (RLN2) THEN GOTO 200 ELSE GOTO 300 ENDIF ENDIF 5.1.2 IF-THEN-ELSE The syntax of a IF-THEN-ELSE statement is: IF (logical_expr) THEN block1 [ELSE block2] ENDIF block1 is executed if logical_expr is TRUE, otherwise block2 is executed, if it exists. Back-end compilers usually translate IF-THEN-ELSE statements to the evaluation of the condition and a conditional branch. The result of the condition evaluated can be stored in the internal flags (processor status word) or in a register. We will use a logical register to hold the condition result because this is more general (it would be easy to adapt to a target machine with processor status word). For example, consider the following statement IF (A.EQ.B) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF If A and B are stored in RFS1 and RFS2 respectively, the statement is represented internally as: RLN1 = RFS1.EQ.RFS2 IF (RLN1) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF 15
In ANSI F77 if the condition is a composed clause, then it is evaluated until it becomes true or until a THEN keyword is found. In our representation the condition is entirely evaluated. So for instance, IF ((X.EQ.Y) .OR. (Y.EQ.Z)) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF Assuming that X is stored in RFS1, Y in RFS2 and Z in RFS3, the statement is represented internally with the following sequence of Fortran statements: RLN1=RFS1.EQ.RFS2 RLN2=RFS2.EQ.RFS3 RLN3=RLN1.OR.RLN2 IF (RLN3) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF 5.1.3 Logical IF The Fortran logical IF has the following syntax IF (exp) s where exp is a logical expression and s is one and only one Fortran statement. In our representation, one original Fortran statement might become more than one statement so we must translate logical IF to an IF-THEN statement. In example: Assuming that A and B are automatic local variables, it is represented internally as: RFS1 = [A,mem(displ(A),FP)] RFS2 = [B,mem(displ(B),FP)] RLN1=RFS1.EQ.RFS2 IF (RLN1) THEN RFS3=[B,mem(displ(B),FP)] [A,mem(displ(A),FP)]=RFS3 ENDIF
5.2 DO Statements We consider three kinds of do loops: DO-ENDDO, DO-LABEL and DO-WHILE. Although the last one is not defined in the ANSI F77, most of the compilers accept it so we have introduced it in our representation. 5.2.1 DO-ENDDO loops DO-ENDDO loops have the following syntax: 16
DO index = num_expr1, num_expr2 [,num_expr3] {loop_body} ENDDO In Fortran the limits and the step of the loop are computed only once, before the first iteration. We compute also their expressions before the loop and store the results in registers. In example, the following loop DO I=J*3, K+L, N {body} ENDDO Supposing J, K, L and N stored in RIS1, RIS2, RIS3 and RIS4, respectively, it is represented internally as: RIS5 = RIS1 * 3 RIS6 = RIS2 + RIS3 DO I = RIS5, RIS6, RIS4 {body} ENDDO 5.2.2 DO-LABEL loops DO-LABEL loops have the following syntax:
label
DO label index = num_expr1, num_expr2 [,num_expr3] {loop_body} {statement}
We transform this kind of DO loops to DO-ENDDO loops because the labeled statement might become more than one statement in our representation. In example, consider the next loop
10
DO 10 I=... ... GOTO 10 ... A=A+1
The statement A=A+1 is transformed to n statements (S1..Sn) in our internal representation:
10? 10?
DO 10 I= ... ... GOTO 10 ... S1 ... Sn
The label should be placed at the last statement of the loop indicating the end of the loop body, but also should be at the S1 statement because of the GOTO 10 statement. We decided to transform it to a DO-ENDDO statement leaving the label at the first statement S1: 17
10
DO I=... ... GOTO 10 ... S1 ... Sn ENDDO
5.2.3 DO-WHILE loops DO-WHILE loops have the following syntax: DO WHILE (exp) {body} ENDDO We compute the expression exp before the loop and substitute it in the DO-WHILE statement by the logical register that holds its value. Moreover, it is necessary to include that computation at the end of the loop body in order to update the value of the logical register. In example, DO WHILE (I .LT. 10) I = I + 1 ENDDO Assuming that I is an automatic local variable, it is represented internally as RIS1 = [I,mem(displ(I),FP)] RLS1 = RIS1 .LT. 10 DO WHILE (RLS1) RIS2 = [I,mem(displ(I),FP)] RIS3 = RIS2 + 1 [I,mem(displ(I),FP)] = RIS3 RIS4 = [I,mem(displ(I),FP)] RLS1 = RIS4 .LT. 10 ENDDO
5.3 Computed and assigned GOTOs Both computed and assigned GOTOs can cause a transfer of control to more than two destination points. Because our objective is to produce a Fortran code as close as possible to machine code, we transform these statements to IF-THEN statements, which can transfer control only to two destination points. 5.3.1 Computed GOTO This statement has the following syntax GOTO (label_list), i 18
where i is a scalar integer expression. The execution of a computed GOTO statement causes evaluation of the scalar integer expression i. If 1≤i≤n where n is the number of labels in label_list, a control transfer is done to the ith label in the list. If i is less than 1 or greater than n, no branch is done. We represent it with a list of IF-THEN statements. In example, the following statement: GOTO (10,20,30),i Supposing i in RIS1, the previous statement is represented internally as: RLN1 = RIS1.EQ.1 IF (RLN1) THEN GOTO 10 ENDIF RLN2 = RIS1.EQ.2 IF (RLN2) THEN GOTO 20 ENDIF RLN3 = RIS1.EQ.3 IF (RLN3) THEN GOTO 30 ENDIF 5.3.2 Assigned GOTO This statement has the following syntax ASSIGN label TO scalar_int_var ... GOTO scalar_int_var[, label_list] The execution of an ASSIGN statement causes the assignation of a label to an integer variable. This label can be referenced only in assigned goto or in I/O statements. We must preserve the ASSIGN statement in case it is used in an I/O statement, but add a new statement to handle the attached assigned GOTO statement. The execution of an assigned GOTO statement may cause the control flow to be transferred to one of the multiple target statements identified by the label. If the actual value of the label is not in the label_list then no branch is done. In our internal representation the assigned GOTO statement is transformed to a computed GOTO statement as we explain below, and then it is applied the transformation of section 5.3.1. This scheme solves the problem of multiple control transfers in only one statement. The transformation of an assigned GOTO statement into a computed goto is done in the following manner: a shadow integer variable is generated for every variable loaded in an ASSIGN statement. This variable may have different values depending on the set of values of scalar_int_var. After each ASSIGN statement the corresponding value is 19
loaded on that shadow integer variable. In the generated computed GOTO statement the value of this variable is tested. The compiler can analyze which ASSIGN statements reach the assigned GOTO and generate only the necessary IF-THEN statements. In example, the following statements: assign 20 TO I ... goto I,(10,20,30) need the new variable IAGOTO which will be stored in the stack, and are represented internally as: ASSIGN 20 TO I [IAGOTO,mem(displ(IAGOTO),FP)] = 2 ... RIS1 = [IAGOTO,mem(displ(IAGOTO),FP)] RLN1 = RIS1.EQ.1 IF (RLN1) THEN GOTO 10 ENDIF RLN2 = RIS1.EQ.2 IF (RLN2) THEN GOTO 20 ENDIF RLN3 = RIS1.EQ.3 IF (RLN3) THEN GOTO 30 ENDIF
6 Code Generation Strategies The main reason the project described here was initiated was to support ongoing research at the Universitat Politècnica de Catalunya on the automatic detection of fine-grain parallelism and on the analysis and evaluation of architectural features to exploit this type of parallelism. We were faced with two options. One was to upgrade an existing compiler, such as the Gnu C compiler [GNU], with powerful analysis and transformation routines. The second option was to extend a source-to-source translator in the form described in the preceding sections. We chose the second option because it seemed easier and more general: This approach is easier because extending one of the many source-to-source translators available today involves only the addition of some translation steps such as the transformation of assignment statements into triplets, the generation of the instructions needed to compute addresses, etc. In fact, we have implemented the translation to low level form in both Polaris [BEKG94] and Parafrase-2 [PGHL89] at the cost of only six man-months. 20
It is also more general because most of the optimizations needed to generate efficient code, such as common subexpression elimination, are already available in many sourceto-source translators. Some additional optimizations such as register allocation have to be implemented because they are usually not available in source-to-source restructurers, but they are only a fraction of the total number of optimizations needed and in any case they would have to be done for each new architecture because these optimizations are usually part of a complex heuristic to generate code for fine-grain parallelism. In this section we present a brief outline of the steps we believe are necessary to generate compilers and other tools that are useful for the research at UPC. The idea is to start with the source-to-source restructurer and represent Fortran programs in the way described above. There are at least two classes of tools that would be useful to study fine-grain parallelism. One would be used to study machine organization using simulators. For this class of tools, it is not necessary to generate machine code because the simulators can operate directly on the low-level Fortran code. The second class of tools would generate machine code and can be used to study the effect of transformations on real machines whose accurate simulation would be too expensive and therefore unfeasible. For both classes of tools it is necessary to analyze and transform the code in the best possible way. Some of these analysis and transformations are useful regardless of the target machine and will have to be implemented unless they are already part of the sourceto-source translator: 1.
Constant/value propagation
2.
Common subexpression elimination
3.
Strength reduction
4.
Dead code elimination.
5.
Invariant removal
6.
Privatization
7.
Induction variable recognition
8.
Recurrence (including reduction) recognition
9.
Dependence analysis
10.
Interprocedural analysis (interprocedural versions of the above analysis and transformation passes). Inlining. Cloning.
Some of these passes should be done on the low-level representation (e.g. strength reduction) while others can also be done on the original program. Deciding at what level to apply a transformation or analysis phase and its effect on the performance and accuracy of the compiler is still an open problem. After the machine-independent transformations have been applied, we can proceed to apply the machine-dependent transformations. These include register allocation and code reordering. Usually, these transformations interact with each other and are based on heuristics that use information on dependences, private variables, induction variables, 21
reductions, etc. A number of heuristics have been developed for machine-dependent transformations. However, there is still much room for evaluation and perhaps for improvement. Also, new architectural features will present new challenges. As mentioned above, the output of the machine-dependent phase can be used in a number of research studies. However, for those projects requiring executable code, it is necessary to develop a last phase to generate assembly or object code. We have not developed this phase at the time this report was written. However, it seems clear it should be straightforward and require only one pass through the low-level representation. In fact, it seems that all the steps can be represented as simple rules that only require information from each statement separately.
7 Future Research Projects As mentioned in the previous section, a number of research projects can be supported with a source-to-source restructurer extended as described in this report. In this section we describe some research topics that will be supported in a near future by our source-tosource restructurer. In all cases, the underlying consideration is that a compiler is necessary to experiment with real applications in the evaluation of proposed designs and compiler algorithms.
7.1 Architectural Features of Superscalar Machines There are many issues involved in the design of superscalar machines including the number and type of functional units and the organization of the register files. This does not only include an evaluation of architectural features, but also the evaluation of algorithms to perform low-level scheduling, register allocation and spill code. For instance, [LVAL94] studies the register requirements of software pipelined loops for different superscalar and superpipelined configurations. The impact in performance of having a finite number of registers in the register file is also evaluated in terms of increase in memory traffic and slowdown. Two different register organizations are proposed in [LVFA94] and [LlVA95]. In [LVFA94], a new organization consisting of a small high-bandwidth multiported register file and a low-bandwidth port-limited register file (called sack) is presented. The sack has a single read/write port and therefore it is cheaper (in area) and faster than the multiported file, so it can contain a high number of physical registers. An algorithm to assign values to the sack is proposed. In [LlVA95] a non-consistent dual register file is presented as a new register organization to reduce register pressure. This organization is inspired in the implementation of the register files of some new processors such as the Power2 [WhDh93]; the register file is implemented as two register subfiles with the same number of registers, same number of write ports but half the number of read ports into each register subfile. The two subfiles are consistent in the sense that both store exactly the same value in the same registers. This implementation reduces the complexity of the register file. In [LlVA95] the authors 22
propose and evaluate an organization where each subfile can be accessed independently of the other and store different values; this gives the freedom of storing some values in the two subfiles in a consistent way or storing some values in just one of the two subfiles. The authors prove that the organization proposed is cheaper than doubling the number of registers, does not penalize the access time to the register file and in most of the cases it is as effective as doubling the number of registers. Topics of future research include: evaluation of other register file organizations with several sacks and algorithms to assign values to them. The optimization of spill code is another area that needs further research effort. Having the compiler whose development is proposed in this report would allow to perform an evaluation of the architectural features and algorithms to exploit them for real programs (benchmarks for the Perfect Club [Poin89], SPEC [Dixi91], and proprietary applications). Otherwise, the evaluation is restricted to small benchmarks and loops such as the Livermore Loops [McMa72].
7.2 Vector Processors and Decoupled Architectures Another project deals with the study of vector processors and decoupled architectures. The first part of the project focuses on the evaluation of the Convex architecture including a number of measurements of the characteristics of vector instructions and their frequency [Espa94]. It has been necessary to work with the assembly code generated directly by the Convex compiler. This limits the type of analysis that can be done and complicates the instrumentation of the code for dynamic measurements. The availability of a complete restructurer for the Convex should help to better understand the relationship between vectorization and functional parallelism and should facilitate the measurement of the characteristics of the object code. The restructurer would provide two main facilities to help in the analysis of vector architectures. First, because the restructurer has an abstract representation of the program being analyzed that provides an easy mechanism to locate basic block boundaries and modify the program code with the inclusion of instructions that collect different statistics about the running program. Second, the restructurer can provide the data dependence graph of the whole program and thus with the extension of the internal representation with vector operations it will be possible to modify the vector code generated by the compiler to adapt it to new, theoretical vector architectures. With the statistical data gathered during the first phase of this projects, the weak points of vector architectures will be more easily identified and the second phase of the project will focus on possible solutions to these weaknesses. The restructurer is a tool of capital importance in order to evaluate the performance of the new solutions proposed and to adapt the generated code to these new architectures. Among the different new vector architecture designs that will be explored there are many that will need new compiler technology, such as decoupling, multi-level functional units, new vector register banks organizations, etc. Any modification in the architecture that includes one of the above mentioned topics will immediately require the compiler to be aware of the new features of the machine. Thus, the availability of the restructurer will give us the opportunity to 23
modify the code generation algorithms of the compilers to adapt them to the new schemes being studied. The study of alternative designs and compiler strategies for decoupling architectures should also profit from the strategy presented here.
7.3 Low-level Instruction Scheduling Low-level parallelism and instruction scheduling for superscalar and VLIW architectures is another area of research that would benefit from this compiler. For instance, the evaluation of program characteristics such as inherent parallelism [BoLB93] and implementation of scheduling algorithms [BaLB94, BaLA94] that try to attain the available parallelism and improve memory locality would be possible with more detail. The impact of architectural restrictions (such as number of functional units, operations performed by them, number of registers and ports to memory,...) on the parallelism obtained by the scheduling algorithms on real applications is an example of research work that is planed for the near future. A number of studies on compiler strategies that take into account both high and low-level information are possible, including the study of trade-offs between high and low-level parallelism and the design of strategies to enhance locality not only at the main memory and cache levels, but also at the register level.
8 References [ABCC88]
F. Allen, M. Burke, P. Charles, R. Cytron and J. Ferrante. An overview of the PTRAN analysis system for multiprocessing. Journal of Parallel and Distributed Computing. Vol.5, 1988.
[BEKG94]
B. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu and S. Weatherford. Polaris: The Next Generation in Parallelizing Compilers. Proceedings of the Seventh Workshop on Languages and Compilers for Parallel Computing. 1994.
[BENP93]
U. Banerjee, R. Eigenmann, A. Nicolau, and D. Padua. Automatic Program Parallelization. Proceedings of the IEEE, 81(2), February 1993.
[BaLA94]
C. Barrado, J. Labarta and E. Ayguadé. “An Efficient Scheduling for Doacross Loops”. Proceedings of the ISMM Parallel and Distributed Computing and Systems. 1994.
[BaLB94]
C. Barrado, J. Labarta and P Borenzstejn. “Implementation of GTS”. Proceedings of the Int. Conf. on Parallel ARchitectures and Languages Europe. 1994.
[BoLB93]
P. Borensztejn, J. Labarta and C. Barrado. “Measures of Parallelism at Compile Time”. Proceedings of the 1st EUROMICRO Workshop on Parallel and Distributed Processing. 1993.
[Dixi91]
K. Dixit, “The SPEC Benchmarks”, Parallel Computing, No. 17, 1991. 24
[Espa94]
R. Espasa et al., “Quantitative Analysis of Vector Code”, Proceedings of the 3rd EUROMICRO Workshop on Parallel and Distributed Processing. 1995.
[LlVA95]
J. Llosa, M. Valero and E. Ayguadé. “Non-consistent Dual Register Files to Reduce Register Pressure”. Proceedings of the 1st Int. Symposium on High Performance Computer Architecture, 1995.
[LVAL94]
J. Llosa, M. Valero, E. Ayguadé and J. Labarta. “Register Requirements of Pipelined Loops and its Effects on Performance”. Proceedings of the 2nd Int. Workshop on Massive Parallelism. 1994.
[LVFA94]
J. Llosa, M. Valero, J. Fortes and E. Ayguadé. “Using Sacks to Organize Registers in VLIW Machines”. Proceedings of the CONPAR94-VAPP-VI conference. 1994.
[McMa72]
F.McMahon, “Fortran CPU Performance Analysis”. Lawrence Livermore Laboratories, 1972.
[PGHL89]
C. Polychronopoulos, M. Girkar, M. Haghighat, C. Lee, B. Leung. “Parafrase-2: An environment for parallelizing, partitioning, synchronizing, and scheduling programs on multiprocessors”. Proceedings of the Int. Conf. on Parallel Processing, Vol. II. 1989.
[Poin89]
L. Pointer, “Perfect Report: 1”, CSRD Report No. 896, University of Illinois, 1989.
[ScKo86]
R. G. Scarborough and H. G. Kolsky. “A Vectorizing Fortran Compiler”. IBM Journal of Research and Development. Vol. 30, No. 2, pp. 163-171. March 1986.
[Stal94]
R.M. Stallman. “Using and Porting GNU CC”. Free Software Foundation, 1994.
[TWLP91]
S.Tjiang, M.Wolf, M.Lam, K.Pieper and J.Hennessy. “Integrating Scalar Optimization and Parallelization”. 4th Workshop on Languages and Compilers for Parallel Computing. 1991.
[WhDh93]
S. White and S. Dhawan. “POWER2: Next Generation of the RISC System/6000 Family”. IBM RISC System/6000 Technology: Volume II. IBM Corporation. 1993.
[Wolf82]
M. Wolfe. “Optimizing Supercompilers for Supercomputers”. PhD Thesis. University of Illinois. Department of Computer Science. 1982.
[Zima91]
H.P. Zima. “Supercompilers for Parallel and Vector Computers”. ACM Press. New York, NY. 1991.
25