IctÃneo: a Tool for Instruction Level Parallelism Research 1 ... - CiteSeerX

Ictíneo: a Tool for Instruction Level Parallelism Research1 E. Ayguadé, C. Barrado, A. González, J. Labarta, J. Llosa, D. López, S. Moreno, D. Padua2, F. Reig, Q. Riera, and M. Valero Departament d’Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona November 7, 1996

Abstract Current Fortran optimizing compilers often include source to source transformations for automatic parallelization or vectorization of loops. Lower level optimizations, such as those that aim to exploit ILP, are performed at later stages at the assembly language level and do not profit from information available at the source code level, such as array subscripts for data dependence analysis. Low-level optimizations could generate better code if this high-level information were available. In this paper we describe a framework in which low-level code close to machine language maintains high-level constructs such as loops, source level variable names, etc. This allows for low and high-level optimizations to be performed in the same framework, giving way to more optimized code. Our low-level representation is a single-assignment virtual assembly language. It can optionally be refined to an even lower level form in which address calculations for memory accesses are made explicit. We have implemented several low-level optimizations aimed at reducing memory accesses and instructions executed. Finally, we present the research projects that have benefited from Ictíneo.

1 Introduction Compiler techniques for automatic detection of parallelism are often described in the literature and implemented as source-to-source transformations [Wolf82, Zima91, BENP93]. These implementations are usually experimental translators for the parallelization and vectorization of loops [ABCC88, PGHL89, BEKG94]. Low-level operations, such as those needed to compute addresses of arrays and scalar parameters, are usually hidden for these source-to-source translators which only represent internally highlevel constructs. Most compilers for high-performance machines include two separate translators: a sourceto-source translator for the automatic detection of parallelism and a back-end compiler that generates machine instructions. This division of labor usually implies duplication of effort and makes it difficult for these traslators to interact. Some analysis and 1. This work has been supported by the Ministry of Education of Spain under contract TIC429/95, and by the CEPBA (European Centre for Parallelism of Barcelona) 2. CSRD, University of Illinois, Urbana-Champain, IL, USA 1

transformation algorithms may be needed by both translators. One reason is that the internal representation is normally different in each translator and therefore much of the code for the analysis and transformation algorithms has to be different. Also, the two translators are usually written by different groups of people, possibly at different times and little effort is made to share the analysis and transformation routines. Data dependence analysis is an example of algorithm that may be needed by both translators. The source-to-source translator needs dependence analysis to detect loop-level parallelism and the back-end needs it to detect instruction-level parallelism. Another example is induction variable recognition and removal which is useful in the source-tosource parallelizer to increase parallelism and facilitate dependence analysis. In the backend compiler, this transformation has to be applied again to identify and remove the redundant induction variables produced by strength reduction. The need for interaction arises when one translator needs information that is naturally available to the other. We will discuss next three classes of interaction. In the first class, the information computed by the source-to-source translator is needed by the back-end compiler. All that is needed in this case is to pass the information together with the transformed program. The interaction becomes more difficult, but still feasible, in the second class of interaction where information from the back-end compiler is needed by the source-to-source parallelizer. For example, the exact machine code translation of the loop body is useful to determine the execution time of the loop body which the parallelizer has to estimate when deciding the degree of blocking or unrolling needed to minimize the effect of the scheduling overhead. One simple way to obtain this information, albeit not very efficient, is to invoke the back-end compiler to determine the exact form of the object code and then feed this information back to the parallelizer. The division of labor becomes an insurmountable obstacle in the third class of interaction where two transformations which are performed separately by each translator need to be consolidated into a single transformation. For example, when translating a loop into parallel form, it seems better to apply a single transformation that takes into account both loop-level and instruction-level parallelism. In a previous paper [ABLL95], we described a set of source-to-source transformations that made possible to represent low-level operations as part of the internal representation of a conventional source-to-source Fortran translator. The strategy was to represent the lowlevel operations as Fortran statements. In this way, all the transformation and analysis routines available in the source-to-source restructurer can be applied to the low-level representation of the program. The source-to-source parallelizer can then be extended to include many traditional analysis and transformation steps, such as strength reduction and register allocation, not usually performed by this translator. The generation of machine instructions is done as a last step by a direct mapping from each Fortran statement onto one or more machine instructions. The source-to-source restructurer is therefore extended into a complete compiler as shown in Figure 1. All transformations, including high-level parallelization and the traditional scalar optimizations, can now be performed in a unified framework based on a single internal representation. One additional advantage of representing the low-level operations as Fortran statements is that the outcome of each 2

transformation, both high and low-level, is a Fortran program that can be executed to test the correctness of the transformation. Another approach that also uses a uniform representation for both high-level parallelization and scalar optimizations was the one followed in the IBM Fortran compiler [ScKo86]. The main difference with our approach is that this compiler evolved from a traditional back-end compiler which was extended to do some of the high-level transformations usually performed in other systems by a source-to-source translator. The Stanford University Intermediate Form (SUIF) [TWLP91] has also a similar goal as the representation described here. Again the difference is that SUIF extends low-level operations with annotations and many of the transformations have been implemented by extending low-level passes so that they can recognize the high-level annotations. In our case we modified a source-to-source translator so that it can also deal with low-level operations. The rest of this report describes the current state of the project. We begin, in Section 2, with an overview of the features of the Ictíneo transformer. In Section 3, we introduce the splitting of Fortran statements into a low-level intermediate form and we describe this intermediate form in section 4. Section 5 presents a refinement of this intermediate form that includes address calculations. In Section 6 we discuss some code generation strategies and in Section 7 we present a number of research projects which have profited from the Ictíneo restructurer.

Fortran-77 Source Code Splitting high-level statements

POLARIS Parser

Optimizations Internal Representation

Dependence Analysis Register Allocation & Instruction Scheduling

Output

Instrumentation Fortran-77 Figure 1 3

Additional information

2 The Ictíneo toolset Ictíneo is a tool for research in computer architecture. As such, the tool is parameterized and can be tailored to most of the current architectures. Ictíneo allows to analyze a Fortran code, to optimize it and to generate object code for a specific machine. A distinguishing characteristic of Ictíneo is that the original and any tranformed code (even the object code) is always represented by legal Fortran statements. This feature allows the reuse of many analysis and optimization routines at different levels. Figure 1 shows a block diagram with the main parts of Ictíneo. Ictíneo is a toolset that offers to the user a set of analysis and optimization modules that can be applied to a Fortran 77 source code. The first step is the parsing of the original program, which is performed using the Polaris parser. The splitting module transforms the program into another equivalent Fortran 77 program but in which only a subset of Fortran statements is allowed. These statements are basically three-operand instructions (two sources and one destination ), data movement instructions and a subset of the control-flow statements. The register allocation and instruction scheduling module can be applied to the split code in order to generate the object code for a specific processor architecture. At any step a dependence analysis can be performed, static information can be generated and also the code can be instrumented to obtain run-time information. This instrumentation can be used either to obtain dynamic statistics or to generate a trace that will feed a simulator of a given architecture.

3 Splitting high-level statements Ictíneo transforms Fortran programs into an intermediate form that contains both highlevel and low-level constructs. This transformation is done in two steps, which are, to a certain extent, independent of each other. The second step is a refinement of the first and, in between them, other transformations may be performed if desired. These two intermediate forms contain only legal Fortran statements, so, at any time, they can be compiled and executed to test the validity of the transformations. We call the first form triplets code and its syntax can be informally described by listing what Fortran statements are allowed: • A subset of Fortran non-executable statements. These contain information such as the characteristics of operands, types of data, and format specifications for input/output. For example: PROGRAM p, EXTERNAL subr, REAL*4 var1, PARAMETER pi=3.14, DATA sqrt2 /1.41/, 10 FORMAT(I4), etc. • Restricted assignment statements of the form dest=src. • Restricted arithmetic and logical expressions of at most two operands, like dest=src1+src2, and dest=src1.OR.src2. • A subset of Fortran control flow satements: do loops, while loops, if-then-else statements, and unconditional gotos. 4

• Subroutine calls, like CALL subr(param1, ..., paramN). • All Fortran input/output statements: READ, WRITE, PRINT, etc. Entry statements and function definitions are two non-executable statements that are not supported. They are both transformed to subroutines. We designate assignment statements as restricted because a Fortran assignment like y=x is performed as a “load” of variable x into a “register”, followed by a “store” from the “register” into variable y. These “registers” are not the physical registers of the machine, but virtual registers that are represented by means of Fortran scalar variables. When we “load” a value into a virtual register, we actually perform a Fortran assignment. We designate arithmetic and logical expressions as restricted because they are limited to two operands, because operands cannot be of COMPLEX type, and because operands must be registers: memory to memory operations are not allowed. The tranformed Fortran program will contain only four control structures: do loops, while loops, if-then-else statements, and unconditional gotos. Arithmetic ifs are transformed into if-then-else blocks and assigned gotos and computed gotos are transformed into a sequence of if statements. Subroutine calls may have any number of parameters, but if they include any of the “forbidden” constructs, like compound arithmetic expressions, then these expressions are first split into low-level statements. The same applies for the arguments of I/O statements. Fortran statements are translated to low-level form using an infinite number of virtual registers, which is the approach used by many compilers. We use the single assignment scheme, where registers are stored into only once. When the program is mapped to a particular architecture, then virtual to physical register mapping can be performed. There are three classes of virtual registers, one for each of the following Fortran basic types: integer, real, and logical; there are no registers for complex and character values. Memory addresses are integer values, so we use integer registers to store addresses. Virtual registers are represented by means of Fortran scalar variables. Their names begin with an “R”, followed by two letters that identify its type and size, and by a number:

R TS nn number size {Single, Double} type {Integer, Float, Logical} Register We represent complex numbers as arrays of two scalars. Complex number operations are translated into a sequence of scalar operations on the real and the imaginary parts. They include addition, substraction, multiplication, division, and several Fortran intrinsic functions that operate on complexes. The following code shows how complex variables are represented. The original program declaration: 5

COMPLEX*8 c is transformed to: REAL*4 c DIMENSION c(1:2) In Fortran numeric applications there is hardly ever any extensive string processing. String use is normally limited to output statements. Ictíneo assigns static memory to strings with Fortran’s DATA statements and no special transformation is carried out upon them. For example, the statement: WRITE (*,*) ’SPEC benchmark 103.su2cor’ becomes: DATA c1 /’SPEC benchmark 103.su2cor’/ WRITE (*,*) c1 So far we have presented an overview of the syntax of the triplets intermediate form. The second intermediate form is a refinement of the triplets code, where some elements remain identical and others are modified. In particular, we alter load/store statements and subroutine calls. In the second intermediate form we explicitely calculate the addresses of program objects. Addresses, rather than program variables are the source and target of loads and stores. For this reason, we call this representation addresses code. Loads like rfs1=x and stores like y=rfs1 in the triplets notation are performed in the addresses notation through subroutine calls. A load becomes: call load_mem(rfs1, address_of( x )) and a store becomes: call store_mem( address_of( y ), rfs1) The actual syntax of load_mem and store_mem differs a little from the syntax presented here and it is presented in the section devoted to the addresses code. Subroutine calls are modified so that their parameters are not variable names that refer to program objects, but the addresses of these program objects. Section 4 explains the triplets intermediate form in more detail and in section 5 we describe the extra transformations that, applied to the triplets code, generate the addresses code.

4 The triplets notation The triplets intermediate form contains Fortran statements with a restricted syntax. This syntax is further described in this section. Subsection 4.1 explains the splitting of 6

assignment statements into loads and stores. In subsection 4.2 we explain how compound arithmetic and logical expressions are split into simpler ones with at most two arguments. Subsection 4.3 is devoted to type coertions in arithmetic expressions and subsection 4.4 to subroutines and functions. Subsection 4.5 deals with I/O statements. Finally, subsection 4.6 explains how control structures, such as if statements and loops, are transformed to simpler structures.

4.1 Assignment statements A Fortran assignment like y=x is split into a load followed by a store. We load from program variables into virtual registers and store from virtual registers into program variables. The type of the program variable and of the virtual register must be the same. We perform loads and stores of logical, integer, and floating point values. Complexes are split into two values and character strings remain unmodified. For example, the assignment y=x is translated into: rfs1 = x y = rfs1

load x into virtual register rfs1 store virtual register rfs1 into y

4.2 Arithmetic and logical expressions Syntactically, Fortran expressions can be binary or n-ary. For example, relational expressions, like a.GT.b or divisions, like a/b are binary, while additions like a+b+c and logical exressions like p.AND.q.AND.r are n-ary. Ictíneo transforms n-ary expressions into a sequence of statements that contain only binary expressions. This splitting into several statements preserves the order of evaluation of the original expression. In the case of arithmetic expressions, Ictíneo transforms them in the following way: first, expressions that contain several operations are split into a series of simple statements comprised of a single operation. Besides, the operands to these operations must be virtual registers, rather than program variables, so values must be loaded into virtual registers before being operated upon. Complex values cannot be loaded into a single register, so we partition complexes into their real and imaginary parts. Operations on complexes are translated into a sequence of scalar operations using these two parts. The syntax of the arithmetic operators “+”, “-”, “*”, and “/” is left unchanged. The exponentiation operator “**” is not likely to be implemented in a modern architecture as a machine instruction, so we transform it to a subroutine call, reflecting the idea that it is performed in software. If a different architecture is to be simulated, a post-pass can particularize the code to reflect specific architectual features. The following statement contains a compound arithmetic expression: y = a + b * c**i 7

Below we show the transformed triplets code. In it, the call to powri_ic performs exponentiation (rfs4 ← rfs3**ris1). rfs1 = a rfs2 = b rfs3 = c ris1 = i CALL powri_ic(rfs4, rfs3, ris1) rfs5 = rfs2*rfs4 rfs6 = rfs1+rfs5 y = rfs6 Compound logical expressions are decomposed into binary operations in the same fashion as arithmetic expressions. A simple example shows this: (p .AND. q) .OR. r For this expression, Ictíneo generates the following sequence of Fortran statements, where we assume that p, q and r are logical values and have been previously loaded into virtual registers rls1, rls2 and rls3, respectively: rls4 = rls1 .AND. rls2 rls5 = rls4 .OR. rls3

4.3 Type coercions in arithmetic expressions Arithmetic operators in Fortran are overloaded. In a mixed-type expression, Fortran compilers perform any necessary type coercions. Ictíneo makes them visible, inserting subroutine calls. For example: INTEGER k REAL x,y x = 2*y*k is transformed into: DATA c1 /2.0/ rfs1 = c1 rfs2 = y rfs3 = rfs1*rfs2 ris1 = k CALL reali_ic(rfs4, ris1) rfs5 = rfs3*rfs4 x = rfs5

integer-to-real coercion

Notice that the integer variable k is converted to real before performing the multiplication, but the integer constant 2 is statically changed to 2.0, without calling reali_ic.

4.4 Subroutines and functions In Fortran, when a subroutine or function call is executed, any expression in the actual argument list is evaluated before control is passed to the subroutine. Ictíneo inserts before the call a sequence of statements in triplets form that evaluate the expressions in the 8

argument list. We apply one further transformation to function calls: they are transformed into subroutine calls. We describe parameter passing in subsection 4.4.1. The transformation of functions into subroutines is illustrated in subsection 4.4.2. Subsection 4.4.3 explains how Ictíneo deals with library functions. 4.4.1 Parameter passing Ictíneo inserts code before subroutine and function calls that evaluates any expressions that the argument list might contain. In Fortran, parameters are passed by reference, so the result of the evaluation is stored first into a temporal variable and this variable is placed in the argument list in the position of the expression. When an argument is not an expression, but a variable name (simple or subscripted), an array name or a subprogram name, Ictíneo does not modify it. When an argument is a compound expression, a variable of type complex, or a numeric or string constant, Ictíneo transforms it as follows: compound expressions (arithmetic, relational, logical and character) are split into a series of low-level statements that are inserted before the call. Complexes are transformed to an array of two components. We insert this array as a parameter in lieu of the complex. Constants are assigned static memory with Fortran’s DATA statement and the DATA name replaces the constant. If the call is a function, it is transformed into a subroutine call. For example, the following call statement: CALL subr(x,y+10) is transformed into: DATA rfs1 rfs2 rfs3 tmf1 CALL

c1 /10.0/ = y = c1 = rfs1+rfs2 = rfs3 subr(x, tmf1)

Here, the variable x has a memory address, so we use it as a parameter, but the expression y+10 does not have an address, so we compute it and store it into the temporal variable tmf1, which has a memory address. 4.4.2 Functions Ictíneo transforms function definitions to subroutine definitions. The reason for this is that, once we decided to represent complexes as arrays of two scalars, we faced the problem of Fortran functions that returned complexes. Having a pair of scalars rather than a complex was better for our purposes, but, in Fortran, functions cannot return array values, so we decided to transform complex function definitions to subroutine definitions. For homogeneity, we transform all functions. Equivalently, function calls are transformed to subroutine calls. So, what happens to function results? Return values are loaded into return registers that are inserted at the beginning of the parameter list. Only one register is 9

needed to hold a scalar result, and two are needed for complex results. For example, the following function definition: COMPLEX FUNCTION func(c,a) COMPLEX c REAL a { accesses to c and a } func= {new value} RETURN END is transformed into this subroutine definition: SUBROUTINE func(rfs1,rfs2,c,a) REAL c(2) REAL a REAL rfs1,rfs2 { accesses to c and a } rfs1={new value} rfs2={new value} RETURN END In a similar way, the following function call: COMPLEX c1,c2,func REAL x c1=func(c2,x) is transformed into a subroutine call statement: REAL c1(2),c2(2) REAL x CALL func(rfs1, rfs2, c2, x) c1(1) = rfs1 c1(2) = rfs2 The only function calls that cannot be transformed are those that are part of an implied DO loop inside an I/O statement, like WRITE(*,*) ( func(i), i=1,n ). We cannot transform the implied DO loop to an explicit DO in which we perform a call to WRITE in each iteration because it does not produce the same output: it inserts carriage returns between values. Therefore, function calls inside implied DO loops that are within I/O statements are not transformed to subroutine calls. 4.4.3 Interface to library subroutines The above transformation can only be applied to function calls and function definitions of the program we are transforming. However, there are plenty of useful functions available in the standard Fortran library that programmers link to their code. They are usually precompiled and their source code is not generally available for being transformed by Ictíneo. In order to avoid this problem, we provide a library of subroutine definitions that perform the desired function calls to the standard Fortran library. This library must be 10

linked with Ictíneo’s output in order to execute it. For example, the following subroutine declaration from our interface library: SUBROUTINE sinr_ic(y,x) y=SIN(x) RETURN END allows us to safely transform a function call like this: y = SIN(x) into a subroutine call: CALL sinr_ic(rfs1, x) y = rfs1 Intrinsic function names are overloaded, allowing us to call them with arguments of different types. Multiple definitions are provided for the same function with arguments and results of different types. We have, for example, sinr_ic for parameters of type real and sind_ic for double precision parameters. A special case of Fortran intrinsic functions are min and max, which accept a variable number of arguments. When they are called with more than two arguments, we split them into a series of calls, each one having two arguments. Several of the Fortran intrinsic functions are machine instructions in some architectures (like sin in the 80x86 family), while the rest are implemented as subroutine calls. The list of intrinsic functions that belong to each group is user-definable. Intrinsic functions implemented by machine language instructions operate on virtual registers, so the values of program variables are loaded into registers before the call is executed. Let us see how we transform the call y=sin(x) in the case that it were a machine instruction: rfs1 = x call sinr_ic(rfs2, rfs1) y = rfs2 For intrinsic functions that are implemented as software subroutine calls, we need to follow Fortran pass by reference conventions. We do not perform a load into a register, but pass a parameter that has a memory address. For y=sin(x), when it is implemented in a software library, we use the program variable and there is no load before the call: CALL sinr_ic(rfs1, x) y = rfs1 Appendix A contains a full list of the interface subroutines. 11

4.4.4 Entry statements Entries are a Fortran artifact that provide an entry point into the middle of a subroutine body. Ictíneo transforms entries to full autonomous subroutines, which are a cleaner programming concept. This also allows for some types of inter-procedural analysis of source code.

4.5 I/O statements Expressions that are parameters of I/O statements are tripletized and loaded into registers. A simple example should be sufficient to clarify this. The statement: WRITE(*,*) 123, a+1, func(a) is changed into: DATA c1 /1.0/ ris1 = 123 rfs1 = a rfs2 = c1 rfs3 = rfs1+rfs2 CALL func(rfs4, a) WRITE (*, *) ris1, rfs3, rfs4

4.6 Control Structures In this section we describe the source-to-source transformations of basic control structures. The tranformed Fortran program will contain only four control structures: do loops, while loops, if-then-else statements, and unconditional gotos. The other control structures are transformed into these using relatively simple strategies. For example, arithmetic ifs are transformed into if-then-else statements and assigned gotos and computed gotos are transformed into a sequence of if statements. It is possible to transform the if and do structures into more elementary operations. For example, a do loop may be transformed into a sequence of statements that initialize, increment, and test the value of the index variable. However, we maintain loops as highlevel constructs. Accordingly, loop induction variables are not loaded into registers. Maintaining the original name of the variable and the loop construct allows us to perform dependence analysis, loop scheduling and other studies using high-level information. At any moment, though, machine code can be generated by straight-forward transformations of these high-level control structures. There are three kinds of IF statements in Fortran: arithmetic IFs, logical IFs and IFTHEN-ELSEs. In machine code, an IF statement is translated to an evaluation of the condition and a conditional branch with only two possibilities: continue sequentially or perform the branch. We translate IF statements into a form as close as possible to this one. 12

4.6.1 IF-THEN-ELSE The syntax of an if-then-else statement is: IF (logical_expr) THEN block1 [ELSE block2] ENDIF If logical_expr is TRUE, block1 is executed; optionally, if it evaluates to FALSE, block2 is executed. Back-end compilers usually translate if-then-else statements into the evaluation of the condition followed by a conditional branch. The result of the condition can be stored in internal condition bits (processor status word) or in a register. We use a logical register to hold the condition result because it is more general (it is easy to adapt to a target machine with condition bits in a processor status word). For example, consider the following statements: IF (a.EQ.b) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF If A and B are already stored in rfs1 and rfs2 respectively, Ictíneo generates: rls1 = rfs1.EQ.rfs2 IF (rls1) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF In ANSI F77, if the condition is a compound clause, it is evaluated until it becomes true or until a THEN keyword is found. In our representation, the condition is entirely evaluated. IF ((x.EQ.y) .OR. (y.EQ.z)) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF Assuming that x is stored in rfs1, y in rfs2 and z in rfs3, Ictíneo generates tthe following sequence of Fortran statements: rls1=rfs1.EQ.rfs2 rls2=rfs2.EQ.rfs3 rls3=rls1.OR.rls2 IF (rls3) THEN {THEN_BLOCK} ELSE {ELSE_BLOCK} ENDIF 13

4.6.2 Arithmetic IF Arithmetic ifs have the following syntax: IF (exp) l1, l2, l3 where exp is an integer or real arithmetic expression and l1,l2,l3 are valid labels. The result of this statement is that control is transferred to l1,l2, or l3 if the expression is less, equal or greater than zero, respectively. Ictíneo first transforms the arithmetic expression and then inserts code that compares to zero and branches. For example, consider the statement: IF (a(I)-1.0) 100, 200, 300 Assuming that a(I)-1.0 has been evaluated and stored in register rfs1, the conditional statement is translated into: DATA rfs0 /0.0/ rls1=rfs1.LT.rfs0 IF (rls1) THEN GOTO 100 ELSE rls2=rfs1.EQ.rfs0 IF (rls2) THEN GOTO 200 ELSE GOTO 300 ENDIF ENDIF 4.6.3 Logical IF The Fortran logical IF has the following syntax IF (exp) S where exp is a logical expression and S is one and only one Fortran statement. In our representation, one original Fortran statement might become more than one statement so we must translate logical IF to an IF-THEN statement. For example: IF (a.NE.b) a = b is translated into: rfs1 = a rfs2 = b rls1 = rfs1.NE.rfs2 IF (rls1) THEN rfs3 = b a = rfs3 ENDIF 14

4.6.4 DO-ENDDO loops DO-ENDDO loops have the following syntax: DO index = num_expr1, num_expr2 [,num_expr3] {loop_body} ENDDO In Fortran, the limits and the step of the loop are evaluated only once, before the first iteration. Ictíneo inserts statements that evaluate the limits and step before the loop body, and store the results in registers. For example, the following loop: DO i=j*3, k+l, m {body} ENDDO is transformed into: ( Assume that j, k, l and m are stored in ris1, ris2, ris3 and ris4, respectively) ris5 = ris1 * 3 ris6 = ris2 + ris3 DO i = ris5, ris6, ris4 {body} ENDDO The loop induction variable is accessed inside the loop body without loading it into a register. This allows us to maintain the original name for dependence analysis purposes. 4.6.5 DO-LABEL loops DO-LABEL loops have the following syntax:

label

DO label index = num_expr1, num_expr2 [,num_expr3] {loop_body} {statement}

We transform this kind of loop into an DO-ENDDO loop. Consider this loop:

10

DO 10 i=... ... ... S

The label statement generally becomes more than one statement after being split itself. Once we have transformed the original DO-LABEL statement into a DO-ENDDO statement, we do not need the label to control the loop, but it cannot be removed because the label could serve other purposes, like being the target of an unconditional jump. The transformed code looks like:

10

DO i=... ... ... S1 ... 15

Sn ENDDO where S1, ..., Sn are the result of splitting statement S. 4.6.6 DO-WHILE loops Although they are not defined in the ANSI F77, most compilers accept DO-WHILE loops, so we accept them too for the sake of compatibility. DO-WHILE loops have the following syntax: DO WHILE (exp) {body} ENDDO We transform the expression exp before the loop body and insert a logical register instead of the original expression. Besides, it is necessary to recompute the logical value at the end of the loop body. For example, the loop: DO WHILE (a .LT. b) ... ENDDO is translated into: rfs1 = a rfs2 = b rls1 = rfs1.LT.rfs2 DO WHILE (rls1) ... rfs3 = a rfs4 = b rls1 = rfs3.LT.rfs4 ENDDO 4.6.7 Computed GOTO This statement has the following syntax GOTO (label1, label2,...), expr where expr is a scalar integer expression. The computed GOTO statement passes control to one of several label statements depending on the value of expr. expr is evaluated and truncated to an integer value -the index. The index selects the statement label in the label list to which control is transferred. Ictíneo transforms it to a list of IF-THEN blocks. For example, the following statement: 16

GOTO (10,20,30),i is changed to: ris1 = i rls1 = ris1.EQ.1 IF (rls1) THEN GOTO 10 ENDIF rls2 = ris1.EQ.2 IF (rls2) THEN GOTO 20 ENDIF rls3 = ris1.EQ.3 IF (rls3) THEN GOTO 30 ENDIF The expression is evaluated exactly once before the first IF statement. The result of evaluating it, stored in ris1, is reused in each IF statement. It would be incorrect to evaluate it before every IF, since the expression might produce side effects. 4.6.8 Assigned GOTO The syntax of ASSIGN statements and assigned GOTO statements is: ASSIGN label TO scalar_int_var GOTO scalar_int_var [[,] (label_list)] The ASSIGN statement assigns a statement label to an integer variable. Once a variable is defined by an ASSIGN statement, it can be used in an assigned GOTO statement or as a format specifier in an I/O statement. The assigned GOTO statement transfers control to the statement whose label was most recently assigned to a variable with the ASSIGN statement. Consider the following statements: assign 20 TO i ... assign 30 to i ... goto i In our internal representation, assigned GOTOs are transformed to computed GOTOs, and these are further transformed to a list of IF-THEN blocks as explained in section 4.6.7. At the same time, we must transform ASSIGN statement in such a way that the variable contains an index suitable for a computed GOTO, rather than a label suitable for an assigned GOTO. We do this by creating a new integer variable, which is assigned an index value in the range of the computed GOTO. 17

At this point, the ASSIGN statement is not needed anymnore for this particular assigned GOTO, but we cannot delete it, because it can be used for I/O statements unrelated to our assigned GOTO. A new variable, iagoto, is created and used to control the branching. The code below becomes: INTEGER*4 iagoto ASSIGN 20 TO i iagoto = 1 ... ASSIGN 30 to i iagoto = 2 ... ris1 = iagoto rls1 = ris1.EQ.1 IF (rls1) THEN GOTO 20 ENDIF ris2 = iagoto rls2 = ris2.EQ.2 IF (rls2) THEN GOTO 30 ENDIF

5 The addresses notation In the addresses representation we explicitely calculate addresses before accessing memory objects. In triplets, we had things like: “load the value of variable x into register rfs1”. In addresses we have: “calculate the address of variable x; load the value contained in that memory position into register rfs1”. We need mechanisms to calculate the address of program objects and to access memory using those addresses. Ictíneo uses Fortran’s LOC intrinsic function to obtain the address of objects. The LOC function returns the memory reference (location) of its argument. Analogous to all the other intrinsics in Ictíneo, there is an interface subroutine (loc_ic) that eventually calls the LOC function. A memory reference is an INTEGER value, so Ictíneo employs integer virtual registers to store addresses. We have implemented C subroutines to perform the memory accesses. lmemTS_ic performs loads and smemTS_ic performs stores. T stands for “type” of the program object (logical, integer, or float) and S stands for “size”(single, or double). They take the address of the object as a parameter and perform the load or the store. The name of the object being referenced is added as a redundant parameter to maintain high-level information. These routines use the base+displacement addressing mode for all memory accesses. The base and the displacement are kept in integer virtual registers. 18

In the following subsections we illustrate how we calculate the addresses of scalar and array values and we describe how we perform subroutine calls using the addresses of the actual parameters.

5.1 Scalars For a load like rfs1=x in the triplets representation, the base is precisely the address of x, and the displacement is zero. In addresses code we have: DATA ris0 /0/ CALL loc_ic(ris1, x) CALL lmemfs_ic(rfs1, x, ris1, ris0) Here, ris1 holds the base and ris0 contains the displacement. The call lmemfs_ic(rfs1, x, ris1, ris0) performs a load of the value in address ris1+ris0 into rfs1. By our definitions of loc_ic and ris0, the value ris1+ris0 corresponds to the address of x. Stores are similar; the statement: x = rfs1 is translated to: DATA ris0 /0/ CALL loc_ic(ris1, x) CALL smemfs_ic(x, ris1, ris0, rfs1) Here, smemfs_ic(x, ris1, ris0, rfs1) stores the value in rfs1 into the address designated by ris1+ris0. Notice that the memory access subroutines contain the name of the program variable as an extra parameter. Strictly speaking, all we would need is something like rfs1 ← mem( ris1 + ris0 ) and yet, we have rfs1 ← mem( x, ris1 + ris0 ) The extra parameter allows memory-analysis tools to keep track of what program variable is being referenced.

5.2 Arrays For a one dimensional array A of N elements with limits (l:l+N-1), the address of the element A(i) equals address_of( A )+( i-l )*size( elements of A). The term address_of( A )l*element_size, independent of i, is called the “virtual origin” of the array and will be used as the base for the memory access. The term i*size( elements of A) will be used as the displacement. 19

The following example shows an array declaration and an access to element i: real*4 A(10) A(i) = 1.0 If we suppose that rfs1 contains the value 1.0, this is translated into: DATA CALL ris2 CALL CALL ris5 CALL

ris0 /0/ loc_ic(ris1, A) = ris1-4 ris2 ← V.O. of A loc_ic(ris3, i) lmemis_ic(ris4, i, ris3, ris0) = 4*ris4 smemfs_ic(A(i), ris2, ris5, rfs1)

Here, -l*element_size is -1*4 and corresponds to the value -4. The V.O. is this value plus the address of the array, and it is stored in ris2. We proceed similarly for arrays of more than one dimension. Suppose a matrix A of dimensions NxM, with limits (rl:rl+N-1, cl:cl+N-1) stored by columns. The address of element A(i,j) equals address_of(A)+(N*(j-cl) + (i-rl))*size(elements of A). If this expression is simplified to a normal additive form, we end up with terms that depend on i and j, and others that are independent. The value of the independent terms, address_of( A )+(-N*cl -r1)*size(elements of A), is the virtual origin of the matrix and is used as the base for the memory access. The term (N*j + i)*size(elements of A) is used as the displacement. The example that follows shows how the address calculation is performed. REAL*4 A(10,*) A(i,j) = 1.0 is transformed into: (Again, suppose that rfs1 contains 1.0) DATA CALL ris2 CALL CALL ris5 CALL CALL ris8 ris9 CALL

ris0 /0/ loc_ic(ris1, A) = (-44)+ris1 ris2 ← V.O. of A loc_ic(ris3, i) lmemis_ic(ris4, i, ris3, ris0) = 4*ris4 loc_ic(ris6, j) lmemis_ic(ris7, j, ris6, ris0) = 40*ris7 = ris5+ris8 smemfs_ic(A(i, j), ris2, ris9, rfs1)

In this case, the V.O. is address_of(A)+(-10*1-1)*4, which corresponds to the value ris1-44.

5.3 Calling subroutines In Fortran, parameters are passed by reference. When we have a subroutine call like this: CALL subr( x(i) ) 20

the code in the subroutine body “receives” the address of x(i). In the addresses representation we insert statements that calculate the addresses of values. In our example the following code would be inserted before the subroutine call: DATA CALL ris5 CALL CALL ris3 ris6

ris0 /0/ loc_ic(ris1, x) = ris1-4 loc_ic(ris4, i) lmemis_ic(ris2, i, ris4, ris0) = 4*ris2 = ris2+ris3

We end up with a virtual register (ris6) that holds the address of x(i). Now, if we perform the call CALL subr( ris6 ) the subroutine body receives the address of a pointer to x(i), and this code will not execute correctly. Thus, we need an interface that allows us to pass addresses around and, at the same time, does not break subroutine calls. One solution is to force pass by value by means of Fortran’s %VAL directive, like this: CALL subr( %VAL( ris6) ) This is simple, direct and suits our purposes. However, we loose the high-level information of what program variable is being referenced. To solve this, we opted by calling a C function that performs the dereferencing for us and, at he same time, preserves high-level information. Having this C function gives us the chance to perform further analysis of subroutine calls without the need to add more code to the Fortran output. For example, it could be enhanced to study different ways of parameter passing, such as register windows. The syntax of this function is: CALL alpha_ic(subr, 0, 1, ris6, x(i) ) The arguments of alpha_ic have the following meaning: subr is the name of the subroutine to call after we dereference values; 0 is the number of results that subr returns (it would be 1 for scalar functions and 2 for complex functions); 1 is the total number of parameters that will be passed to subr (number of results + number of arguments); next comes the list of return registers for functions, which is empty for subr (it would contain one register for a scalar function and two registers for a complex function); finally there are two lists [addresses] , [parameters]. In this particluar example, each list contains only one element: ris6 contains the address of x(i)and is followed by x(i), the original parameter of the subroutine. The original program variable, x(i), is not required to complete the subroutine call, since alpha_ic dereferences the address of x(i)to access the value. However, it is included in the parameter list in order to preserve this high-level information. 21

5.4 Calling intrinsic functions Recall from section 4.4.3 that some intrinsic functions might be implemented as machine language instructions in a certain architecture and others might be implemented as software subroutines. The operands of intrinsics that are implemented as machine language instructions need to be (virtual) registers, so we do not need to calculate their address nor call the intrinsic via the alpha_ic interface. In the case of a call to y=sin(x), if sin were a machine instruction, we have: CALL CALL CALL CALL CALL

loc_ic(ris1, x) lmemfs_ic(rfs1, x, ris1, ris0) sinr_ic(rfs2, rfs1) loc_ic(ris2, y) smemfs_ic(y, ris2, ris0, rfs2)

We have loaded the value of x into register rfs1 and called sinr_ic with that value. Now, if sin were implemented as a subroutine call: CALL CALL call CALL

loc_ic(ris1, x) alpha_ic(sinr_ic, 1, 2, rfs2, ris1, x) loc_ic(ris2, y) smemfs_ic(y, ris2, ris0, rfs2)

In this case we are passing the address of x to sinr_ic, so we need to derefence it via the call to alpha_ic.

6 Code Generation Strategies The main reason for initiating the Ictíneo project was to support ongoing research at the Universitat Politècnica de Catalunya on the automatic detection of fine-grain parallelism and on the analysis and evaluation of architectural features to exploit this type of parallelism. We were faced with two options. One was to upgrade an existing compiler, such as the Gnu C compiler [GNU], with powerful analysis and transformation routines. The second option was to extend a source-to-source translator in the form described in the preceding sections. We chose the second option because it seemed easier and more general: This approach is easier because extending one of the many source-to-source translators available today involves only the addition of some translation steps such as the transformation of assignment statements into triplets, the generation of the instructions needed to compute addresses, etc. In fact, we have implemented the translation to lowlevel form in both Polaris [BEKG94] and Parafrase-2 [PGHL89] at the cost of only six man-months. It is also more general because most of the optimizations needed to generate efficient code, such as common subexpression elimination, are already available in many sourceto-source translators. Some additional optimizations such as register allocation have to be 22

implemented because they are usually not available in source-to-source restructurers, but they are only a fraction of the total number of optimizations needed and in any case they would have to be done for each new architecture because these optimizations are usually part of a complex heuristic to generate code for fine-grain parallelism. In this section we present a brief outline of the steps we believe are necessary to generate compilers and other tools that are useful for the research at UPC. The idea is to start with the source-to-source restructurer and represent Fortran programs in the way described above. There are at least two classes of tools that would be useful to study fine-grain parallelism. One would be used to study machine organization using simulators. For this class of tools, it is not necessary to generate machine code because the simulators can operate directly on the low-level Fortran code. The second class of tools would generate machine code and can be used to study the effect of transformations on real machines whose accurate simulation would be too expensive and therefore unfeasible. For both classes of tools it is necessary to analyze and transform the code in the best possible way. Some of these analysis and transformations may be applied regardless of the target machine and will have to be implemented unless they are already part of the sourceto-source translator: 1.

Constant/value propagation

2.

Common subexpression elimination

3.

Strength reduction/Induction variable recognition

4.

Dead code elimination.

5.

Invariant removal

6.

Privatization

7.

Recurrence (including reduction) recognition

8.

Dependence analysis

9.

Interprocedural analysis (interprocedural versions of the above analysis and transformation passes). Inlining. Cloning.

Some of these passes are more effective on the low-level representation (e.g. strength reduction) while others can be applied indistinctly on the low-level and the high-level representation of the program. Deciding at what level to apply a transformation or analysis phase and its effect on the performance and accuracy of the compiler is still an open problem. After the machine-independent transformations have been applied, we can proceed to apply the machine-dependent transformations. These include register allocation and instruction scheduling. Usually, these transformations interact with each other and are based on heuristics that use information on dependences, private variables, induction variables, reductions, etc. A number of heuristics have been developed for machinedependent transformations. However, there is still much room for evaluation and perhaps for improvement. Also, new architectural features will present new challenges. 23

As mentioned above, the output of the machine-dependent phase can be used in a number of research studies. However, for those projects requiring executable code, it is necessary to develop a last phase to generate assembly or object code. We have not developed this phase at the time this report was written. However, it seems clear it should be straightforward and require only one pass through the low-level representation. In fact, it seems that all the steps can be represented as simple rules that only require information from each statement separately.

7 Research Projects As mentioned in the previous section, a number of research projects can be supported with a source-to-source restructurer extended as described in this report. In this section we describe some research that has benefited by our source-to-source restructurer. In all cases, the underlying consideration is that a compiler is necessary to experiment with real applications in the evaluation of proposed architecture designs and compiler algorithms.

7.1 Architectural Features of Superscalar Machines There are many issues involved in the design of superscalar machines including the number and type of functional units and the organization of the register files. For instance, [LVAL94] studies the register requirements of software pipelined loops for different superscalar and superpipelined configurations. The impact in performance of having a finite number of registers in the register file is also evaluated in terms of increase in memory traffic and slowdown. Two different register organizations are proposed in [LVFA94] and [LlVA95]. In [LVFA94], a new organization consisting on a combination of a small high-bandwidth multiported register file and a low-bandwidth port-limited register file (called sack) is presented. The sack has a single read/write port and therefore it is cheaper (in area) and faster than the multiported file, so it can contain a high number of physical registers. An algorithm to assign values to the sack is proposed. In [LlVA95] a non-consistent dual register file is presented as a new register organization to reduce register pressure. This organization is inspired in the implementation of the register files of some new processors such as the Power2 [WhDh93]; the register file is implemented as two register subfiles with the same number of registers, same number of write ports but half the number of read ports into each register subfile. The two subfiles are consistent in the sense that both store exactly the same value in the same registers. This implementation reduces the complexity of the register file. In [LlVA95] the authors propose and evaluate an organization where each subfile can be accessed independently of the other and store different values; this gives the freedom of storing some values in the two subfiles in a consistent way or storing some values in just one of the two subfiles. The authors prove that the organization proposed is cheaper than doubling the number of registers, does not penalize the access time to the register file and in most of the cases it is as effective as doubling the number of registers. 24

Topics of future research include: evaluation of other register file organizations with several sacks and algorithms to assign values to them. The optimization of spill code is another area that needs further research effort. Having the compiler whose development is proposed in this report would allow to perform an evaluation of the architectural features and algorithms to exploit them for real programs (benchmarks for the Perfect Club [Poin89], SPEC [Dixi91], and proprietary applications). Otherwise, the evaluation is restricted to small benchmarks and loops such as the Livermore Loops [McMa72].

7.2 Increasing the memory bandwidth by widening buses • Requirements of increasing the number of buses. Microprocessor performance is increasing at higher rates than the performance of the memory subsystem. Instruction level parallelism (ILP) techniques increase the efficiency of the processor core, requiring low latency and high bandwidth memory subsystems. Low latency memory accesses have been traditionally achieved by using cache memories, which have also contributed to increase the bandwidth of the memory subsystem. However, heavily exploiting ILP has large memory bandwidth demands particularly in numeric applications. To meet the high bandwidth requirements, some current microprocessors have been built with two memory buses. Doubling the number of buses has several costs and drawbacks at different levels of the memory hierarchy: 1) processor core: requires doubling the number of load/store units and additional register-file ports, so the access time increases and there is a noticeable area penalty. 2) on-chip cache: to allow two memory accesses per cycle a dual-ported cache is required. A dual-ported cache can be implemented by duplicating data cache and maintaining identical copies of data in each cache (that doubles the die area of the primary data cache), or by splitting the data cache in two (or more) independent banks (bank conflicts can reduce the effective bandwidth). 3)Translation Lookaside Buffer: Multiporting the TLB can increase cycle time as well as require some extra die area. 4) off-chip connection: Performing two off-chip memory accesses per cycle, requires two buses with address, control and data information. Having two off-chip buses, increases the number of pins required (complicating and making expensive the package of the chip) and increases the complexity of the off-chip memory system (similar problems to on-chip caches). • Widening buses An approach to have high bandwidth with reduced cost is to increase the width of the memory buses and move several (consecutive) words per memory access. The advantage is that to perform two loads/stores per cycle we only require: one issue slot, one address generator, one address translation per cycle, and one “wide” memory port (i.e. one address bus, one control bus and two data buses). Nevertheless, this technique has also some drawbacks. Wide loads/stores must be compacted statically by the compiler, loosing some potential performance gains in programs with irregular access patterns. • Compiling for wide buses To compact statically wide loads/stores, stride information is required. Consider the loop of Figure 2a, and its dependence graph (Figure 2b). Considering an architecture with one 25

bus, two adders and two multipliers, the loop is IO-bounded (three memory operations for two arithmetical operations). Using Ictineo we extend the dependence graph of the innermost loops with stride information (the dotted arrows in Figure 2b). Then unrolling the loop once (Fig 2.c) allows the compiler to compact memory operations. In this case, there are three memory operations for four arithmetical operations). In this case, one wide bus achieves the same performance as two single buses. DO I=1,N C(I)=A(I)*B(I)+D ENDDO

a)

A

b)

c)

B

B0,1

A0,1

1

1 *

*0

D

*1 D

+

+0

C

+1 C0,1

1

Figure 2 (a) Example loop. (b) Its dependence graph with stride information extracted with Ictíneo and (c) the compacted graph

7.3 Low-level Instruction Scheduling Low-level parallelism and instruction scheduling for superscalar and VLIW architectures is another area of research that would benefit from this compiler. For instance, the evaluation of program characteristics such as inherent parallelism [BoLB93] and implementation of scheduling algorithms [BaLB94, BaLA94] that try to attain the available parallelism and improve memory locality would be possible with more detail. The impact of architectural restrictions (such as number of functional units, operations performed by them, number of registers and ports to memory,...) on the parallelism obtained by the scheduling algorithms on real applications is an example of research work that is planed for the near future. A number of studies on compiler strategies that take into account both high and low-level information are possible, including the study of trade-offs between high and low-level parallelism and the design of strategies to enhance locality not only at the main memory and cache levels, but also at the register level.

8 References [ABCC88]

F. Allen, M. Burke, P. Charles, R. Cytron and J. Ferrante. An overview of the PTRAN analysis system for multiprocessing. Journal of Parallel and Distributed Computing. Vol.5, 1988. 26

[BEKG94]

B. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoeflinger, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu and S. Weatherford. Polaris: The Next Generation in Parallelizing Compilers. Proceedings of the Seventh Workshop on Languages and Compilers for Parallel Computing. 1994.

[ABLL95]

E. Ayguadé, C. Barrado, J. Labarta, D. López, S. Moreno, D. Padua and M. Valero. A Uniform Internal Representation for High-Level and Instruction-Level Transformations. Report UPC-CEPBA-95-01.

[BENP93]

U. Banerjee, R. Eigenmann, A. Nicolau, and D. Padua. Automatic Program Parallelization. Proceedings of the IEEE, 81(2), February 1993.

[BaLA94]

C. Barrado, J. Labarta and E. Ayguadé. “An Efficient Scheduling for Doacross Loops”. Proceedings of the ISMM Parallel and Distributed Computing and Systems. 1994.

[BaLB94]

C. Barrado, J. Labarta and P Borenzstejn. “Implementation of GTS”. Proceedings of the Int. Conf. on Parallel ARchitectures and Languages Europe. 1994.

[BoLB93]

P. Borensztejn, J. Labarta and C. Barrado. “Measures of Parallelism at Compile Time”. Proceedings of the 1st EUROMICRO Workshop on Parallel and Distributed Processing. 1993.

[Dixi91]

K. Dixit, “The SPEC Benchmarks”, Parallel Computing, No. 17, 1991.

[LlVA95]

J. Llosa, M. Valero and E. Ayguadé. “Non-consistent Dual Register Files to Reduce Register Pressure”. Proceedings of the 1st Int. Symposium on High Performance Computer Architecture, 1995.

[LVAL94]

J. Llosa, M. Valero, E. Ayguadé and J. Labarta. “Register Requirements of Pipelined Loops and its Effects on Performance”. Proceedings of the 2nd Int. Workshop on Massive Parallelism. 1994.

[LVFA94]

J. Llosa, M. Valero, J. Fortes and E. Ayguadé. “Using Sacks to Organize Registers in VLIW Machines”. Proceedings of the CONPAR94-VAPP-VI conference. 1994.

[McMa72]

F.McMahon, “Fortran CPU Performance Analysis”. Lawrence Livermore Laboratories, 1972.

[PGHL89]

C. Polychronopoulos, M. Girkar, M. Haghighat, C. Lee, B. Leung. “Parafrase-2: An environment for parallelizing, partitioning, synchronizing, and scheduling programs on multiprocessors”. Proceedings of the Int. Conf. on Parallel Processing, Vol. II. 1989.

[Poin89]

L. Pointer, “Perfect Report: 1”, CSRD Report No. 896, University of Illinois, 1989.

[ScKo86]

R. G. Scarborough and H. G. Kolsky. “A Vectorizing Fortran Compiler”. IBM Journal of Research and Development. Vol. 30, No. 2, pp. 163-171. March 1986. 27

[TWLP91]

S.Tjiang, M.Wolf, M.Lam, K.Pieper and J.Hennessy. “Integrating Scalar Optimization and Parallelization”. 4th Workshop on Languages and Compilers for Parallel Computing. 1991.

[WhDh93]

S. White and S. Dhawan. “POWER2: Next Generation of the RISC System/6000 Family”. IBM RISC System/6000 Technology: Volume II. IBM Corporation. 1993.

[Wolf82]

M. Wolfe. “Optimizing Supercompilers for Supercomputers”. PhD Thesis. University of Illinois. Department of Computer Science. 1982.

[Zima91]

H.P. Zima. “Supercompilers for Parallel and Vector Computers”. ACM Press. New York, NY. 1991.

28

Appendix. LibIctíneo Memory access functions:

System library functions (cont)

alpha_ic lmemfd_ic lmemfs_ic lmemid_ic lmemis_ic lmemls_ic smemfd_ic smemfs_ic smemid_ic smemis_ic smemls_ic

getc_ic getcwd_ic getfd_ic getfilep_ic getpid_ic getuid_ic hostnm_ic iargc_ic ieee_flags_ic inmax_ic ioinit_ic irand_ic isatty_ic kill_ic link_ic lnblnk_ic long_ic lshift_ic lstat_ic not_ic or_ic putc_ic rand_ic rindex_ic rshift_ic short_ic stat_ic symlnk_ic system_ic unlink_ic wait_ic xor_ic

System library functions: access_ic alarm_ic and_ic bit_ic chdir_ic chmod_ic ctime_ic drand_ic dtime_ic etime_ic fdate_ic fgetc_ic fork_ic fputc_ic fseek_ic ftell_ic

29

Intrinsic Functions: absd_ic absi_ic absr_ic acosd_ic acosr_ic aintd_ic aintr_ic alog10r_ic alogr_ic anintd_ic anintr_ic asind_ic asinr_ic atan2d_ic atan2r_ic atand_ic atanr_ic chari_ic cosc_ic cosd_ic coshd_ic coshr_ic cosr_ic dabsd_ic dacosd_ic dasind_ic datan2d_ic datand_ic dbled_ic dblei_ic dbler_ic dcosd_ic dexpd_ic dfloati_ic dfloatr_ic dimd_ic

dimi_ic dimr_ic dlog10d_ic dlogd_ic dmodd_ic dsignd_ic dsind_ic dsqrtd_ic dtand_ic expc_ic expd_ic expr_ic floatd_ic floati_ic iabsi_ic iandi_ic ichar_ic idintd_ic ifixr_ic index_ic intd_ic inti_ic intr_ic len_ic lge_ic lgt_ic lle_ic llt_ic loc_ic log10d_ic log10r_ic logc_ic logd_ic logr_ic maxd_ic maxi_ic

30

maxr_ic mind_ic mini_ic minr_ic modd_ic modi_ic modr_ic nintd_ic nintr_ic powdd_ic powdi_ic powii_ic powri_ic powrr_ic reald_ic reali_ic signd_ic signi_ic signr_ic sinc_ic sind_ic sinhd_ic sinhr_ic sinr_ic sngld_ic snglr_ic sqrtc_ic sqrtd_ic sqrtr_ic tand_ic tanhd_ic tanhr_ic tanr_ic

IctÃneo: a Tool for Instruction Level Parallelism Research 1 ... - CiteSeerX

IctÃneo: a Tool for Instruction Level Parallelism Research 1 ... - CiteSeerX

Suggest Documents

Available Instruction-Level Parallelism for Superscalar ... - CiteSeerX

Increasing Instruction-Level Parallelism with Instruction Precomputation

Instruction Level Parallelism through MicrothreadingâA Scalable

compilers - The Journal of Instruction-Level Parallelism

Ch4. Exploiting Instruction-Level Parallelism with Software ...

Advanced Computer Architecture Instruction Level Parallelism ...

Chapter 3 Instruction-Level Parallelism and Its

Limits of Instruction-Level Parallelism Capture - ScienceDirect

CASH - The Journal of Instruction-Level Parallelism

Limits of Instruction-Level Parallelism Capture - Core

Increasing the Instruction-Level Parallelism ... - Semantic Scholar

INSTRUCTION LEVEL PARALLELISM â THE ROLE ...

Instruction level parallelism (ILP) is the base for good ...

Impact of Software Bypassing on Instruction Level Parallelism and ...

Adapting Processor Supply Voltage to Instruction-Level Parallelism

Instruction Level Parallelism - the static scheduling approach the ...

Limits of Instruction Level Parallelism Critique - Google Groups

Instruction-level Parallelism in Prolog: Analysis and ... - Google Sites

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism ...

Increasing the Performance of Instruction Level Parallelism ... - ULBS

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism ...

2: Data-Parallelism and Data-Flow 1. The Parallelism ... - CiteSeerX

Instruction scheduling for instruction level parallel processors

Software Carry-Save: A case study for instruction-level ... - CiteSeerX

IctÃ­neo: a Tool for Instruction Level Parallelism Research 1 ... - CiteSeerX