GNU Instruction Scheduler: Ailments and Cures in Context of Superscalarity Andreas Unger and Eberhard Zehendner Computer Science Department Friedrich-Schiller-University D-07740 Jena, Germany fmau,
[email protected]
Abstract | In the past, the GNU C compiler (GCC) has been successfully ported to several superscalar microprocessors. For that purpose, the instruction timing of the target processor usually had been modeled in a straightforward manner. Unfortunately, in our experience, this is likely to lead astray the instruction scheduler. In this paper we describe some of our experiments that revealed such aws, concerning the DEC Alpha 21064 as well as other superscalar RISC processors. We analyze the circumstances that led to poorly scheduled code, and demonstrate how the machine description supplied for a superscalar processor can be modi ed to x some of these problems without hampering the portability of the GCC. On the other hand we show situations for which we do not have a solution within the given framework.
T
I. Introduction
HE GNU C Compiler (GCC) [12] has been designed to combine high portability with the generation of fast executing programs. To render the GCC portable, all machine-dependent information is separated from the machine-independent code, and placed in the machine description. The speci cation of the machine description has to be provided in a prescribed format, which keeps the main code of the compiler truly machine-independent. This general speci cation format determines, which features of a processor can be described. To obtain ecient target code, several optimizations are performed before the nal code generation. Here, we focus on the instruction scheduling passes for pipelined processors. To achieve maximum performance, the pipeline needs to be fed with an instruction stream that keeps all processing units busy. Data dependencies, control dependencies, or sharing of a resource may prevent the instructions in a sequential instruction stream from executing in the shortest possible time succession. In this situation, the instruction scheduler attempts to rearrange favorably the instruction stream, while preserving the semantics of the given program. Thereby, unused time slots between con icting instructions are lled with noncon icting instructions from other positions in the program. Eective instruction scheduling is crucial for achieving high throughput on pipelined processors, and deserves a special interest in context of superscalar processors. This paper documents our analysis of the instruction scheduler used in version 2.6.0 of the GCC. We studied the eects of the scheduler toward four common super-
scalar RISC processors, namely, the DEC MIPS R3000 [5], SUN SuperSPARC [13], IBM RS/6000 [8], and DEC Alpha 21064 [4]; emphasis has been to the Alpha processor. (Of course we expect the spirit of our ndings also to apply to further versions of the compiler as well as to other superscalar processors.) During our investigation we were particularly interested in the following questions: Is the machine description format of the GCC suitable for superscalar processors? Should we be satis ed by the amount of instruction-level parallelism that the GNU instruction scheduler can extract from a single basic block? How can we tune the GNU instruction scheduler to produce more ecient code for a speci c superscalar processor, without changing the machine-independent portions of the provided compiler? In this paper we neither aimed to extend the scope of the instruction scheduler beyond basic blocks [1], [3], [7], [10], nor to compare dierent scheduling algorithms [3], dierent compilers for the same machine, or the performance of a compiler on dierent machines. The structure of the paper is as follows: In section 2 we introduce the GNU C compiler and its optimizer. Section 3 describes our experiments to evaluate the overall performance of the GNU instruction scheduler. In section 4 we discuss some sample programs that helped to reveal weaknesses in the interplay between the instruction scheduler and the machine description; section 5 presents a completely arti cial test program to demonstrate further consequences of this interplay. In section 6 we suggest some simple modi cations to the machine description that signi cantly improve the performance of the instruction scheduler. Section 7 gives a summary of the problems encountered in our analysis of the GNU instruction scheduler, and emphasizes our point-of-view concerning changes in the GCC. II. Compiler and optimizer
The GCC has been designed to guarantee a high portability [12]. Therefore, all machine dependent information was excluded from the general source code and placed into the machine description, which is inserted into the general code during installation of the compiler. In this approach, the properties of any processor the compiler is ported to must be coded in terms of the given speci cation. So the
compiler and especially all optimizers can only use the information about a processor that is expressible by the constructs of the machine description. The basic abstraction of the machine description, related to the scheduler, consists in dividing the processor into function units, to describe these, and to list for each function unit those instructions using it. A special expression define_function_unit is used to introduce a function unit. This expression includes elds for the name of the function unit, the number of identical units in the processor, the maximum number of instructions that can be simultaneously executed in each instance of the function unit, a selection of instructions (or classes of instructions) that are using the function unit, the throughput, and the latency. Via an optional list, the detailed cost of using the function unit can be speci ed. These constructs do not allow a direct description of superscalar implementations, since there are no means to express concurrency. Therefore, all costs are multiplied by the number of instructions that can be executed in parallel. This transformation preserves the proportion of costs between pairs of instructions while signalling the optimizer that there are un lled slots in the instruction stream. So we can state execution costs to force the optimizer to produce an instruction stream which feeds all processing units with the same number of instructions. If there are further restrictions on parallel execution of operations, the processing costs should be speci ed separately via the optional list, mentioned above. However, some restrictions cannot be expressed at all by the given means, e.g., memory alignment. The optimizer
On pipelined processors, the execution order aects the execution speed. The length and the organization of the pipeline are those characteristics of any processor that urge the use of instruction scheduling most. For superscalar implementations, scheduling transformations become even more important, since the amount of usable instructionlevel parallelism increases that is required to completely utilize all processing units. The number of available registers sets a limit to the degree of parallelism that can be exploited. For memory accesses and branch operations, a latency of one cycle can not be achieved in general. For memory accesses, the organization of the memory hierarchy and the characteristics of the data paths in uence the number of cycles needed to transfer the requested data between memory and processor. To reduce the cost introduced by the execution of branch operations, a lot of dierent techniques are employed, such as delayed branches, branch prediction, looking ahead for branches, etc. Instruction scheduling attempts to nd an order of the instructions in a program which will allow fast execution while preserving program semantics. The problem of nding the optimal solution can be shown to be NP-complete [9]. Therefore, a greedy algorithm called list scheduling is used in most established compilers to nd an approxi-
mate solution. Thus, in each step of the algorithm, only the instructions that could ll the next slot are taken into account, regardless of the overall performance of the nal solution. In case of the instruction scheduling problem, available global information can be used to select locally the best alternative. This information represents the critical path of a single basic block, which connects an independent node of the data ow graph (DFG) with any leave node, and which has the longest execution time among all paths. The algorithm tries to schedule the instructions lying on the critical path to execute as fast as possible, and to ll the unusable slots within this path with instructions from other paths of the same basic block. Before running the actual list scheduling, a data ow analysis is performed to locate the critical path, and to provide for preservation of the program semantics when rearranging the operations. Worst case assumptions are made for memory accesses and procedure calls. Due to the kind of data ow analysis, the code section the optimizer is running on is limited to one basic block. After the construction of the DFG, each node is given a weight which initially represents the critical path and later is also used to control the register usage. Then the list scheduling algorithm walks through the DFG in the reverse direction. In addition to a doubly-linked list, which holds the intermediate representation of the input program, the following data structures are used: a list, containing all instructions that are ready, i.e., those that do not have outgoing edges in the DFG, and a queue, containing all blocked instructions, i.e., those that do not have outgoing edges but cannot be executed due to processor resources already used by another instruction. The scheduling algorithm [14]
0. The list and the queue are initialized to be empty. Then all instructions that are ready are moved to the ready-list. 1. All entries of the queue are inspected to determine whether they can be moved to the ready-list. 2. If the ready-list is empty now, the corresponding processor cycle cannot be lled with an instruction. If the compiler has to take care for pipeline con icts, No-Ops are inserted into the program code. 3. The ready-list is sorted by decreasing weights of the entries. 4. The rst entry in the ready-list is removed and put to the last open position in the list that contains the intermediate code of the program. 5. All instructions joined by an edge to the entry that has been processed in the previous step are tested to determine whether they have to be moved to the ready-list or to the queue. All instructions in the ready-list are examined to determine whether they have to go to the queue. 6. End of the loop: If all instructions in the basic block have been processed, the algorithm terminates; otherwise continue with step 1.
III. Overall performance of the GNU instruction scheduler
We focus on obtaining information about the performance of the GNU instruction scheduler, in particular on its ability to expose a sucient amount of instruction level parallelism to superscalar processors, and on those problems arising from its restriction to basic blocks. Widely accessible test suites were used to measure the overall effect of instruction scheduling with respect to the four processors mentioned in section I. We chose one oating-point source program (Livermore Kernels ) and one xed-point source program (Dhrystone ), because the dierences between both kinds of programs are large in general. These programs usually are used as benchmarks for measuring the typical performance of a new machine; in contrast, in this paper, we are de nitely not interested in any kind of processor benchmarking and therefore do not present any information about machine-related performance. Both source programs were compiled with the GCC (version 2.6.0) in optimizing level O2, with and without enabling instruction scheduling. (The Livermore Kernels rst had to be translated from FORTRAN to C using f2c.) Each run of one of the compiled programs yielded several gures that should characterize the performance of the system the test was applied to. In the context of our investigation, this system consisted of the GCC and one of the target machines. Output values from a hundred sample runs, for each program, were averaged to lter out random eects that could be caused by any other process running on our test machines at the same time. We thus suppose that the dierences between the averages for the optimized and the non-optimized programs exclusively re ect the eects of the instruction scheduler on the performance of the sample programs. Table I shows the average performance improvements achieved by using the GNU instruction scheduler. Since xed-point operations usually have smaller latencies than oating-point operations, the improvement that can be achieved for Dhrystone by the instruction scheduler is smaller than for the Livermore Kernels; using instruction scheduling even causes a slight slowdown on the MIPS. We also tried to measure the impact of the loop unrolling transformation of the GCC on instruction scheduling, without signi cant results. The Livermore Kernels are composed of several loop kernels extracted from various numerical programs. The benchmark outputs an overall result, as well as some performance values for each loop kernel. These are not suitable for the characterization of a system, since they are based on small code fragments. We abuse these values to nd code portions where the instruction scheduler performs poorly. From this portions we distill tiny programs (discussed in section IV) that show the same modest behavior. We summarize the results of our investigation as follows: For the SuperSPARC, the transformation performed by the scheduler has a positive result on all loop kernels; for 16 kernels out of 24 it is greater then 15%. We could recognize the largest variation on the Alpha 21064 processor; the effect of scheduling reaches from a slowdown of about 26%
void main() { static long int i; int x[2];
}
for (i = 2; i = 0 && get_attr_type (dep_insn) == TYPE_ICMP seems to output a code sequence that results in this stall; && recog_memoized (insn) >= 0 at least no other documented property of the processor can && get_attr_type (insn) == TYPE_IBR) explain why the extracted program should be slowed down return 2; by about 35%. It is even harder to avoid an occurrence Fig. 6. Alternative modi cation of the machine description of this stall. First, not only memory accesses to the same address can cause a stall, but operations on neighbor cells, too. So it is more dicult to decide whether two memory wards, because otherwise they would inhibit the possibilioperations will cause a stall. On the other hand, more ties of rearranging. than two instructions are involved in a situation causing VI. Modifications to the machine description a pipeline stall. Such constellations were not taken into account in the design of the instruction scheduler. The code sequence presented in section IV stimulated the instruction scheduler to slow down the program. In V. An artificial test program this section we demonstrate modi cations to the machine Finally, we present a program not based on the Liver- description that prevent the scheduler from producing inefmore Kernels. We constructed this program to show that cient code under the same conditions. In a target program the calculation of the weights for the DFG nodes may mis- compiled from the source code using DEC's OSF CC (verlead the algorithm. This calculation [14] is based on the sion 4.3), the compare instruction and the branch instrucdata dependencies, but does not take into account the con- tion took the last two positions in the basic block. Thus we
icts due to the usage of the same function unit. There- suggest to declare all integer store instructions as expenfore, two instructions were chosen which share a function sive as the compare instruction for a following branch, or unit but cannot be executed directly in sequence without to bind the compare instruction to the branch instruction. a pipeline stall. They are placed in the source program in The second variant seems too restrictive, but taking into such a way that they get the same weight. By inserting ad- account all dual issuing rules, one can deduce a probability ditional instructions into the program the cycles between of approximately 10% to lose a single cycle on a pair of these two instructions are lled. compare and branch instructions. Unfortunately, the given Figure 4 shows our source program for the Alpha pro- machine description does not allow to change easily the cost cessor. The two con icting instructions are multiplications. for integer stores, only. We therefore tested two alternative They are assigned a weight of one since these are the rst modi cations. First, the cost for any store instruction and instructions on a path through the DFG. The rst assign- a branch were increased by introducing a new, ctitious ment to variable c is necessary to have some operations that function unit, which is used by the store and the branch, do not depend on either multiplication. The nal assign- and needs two cycles to complete a store. See gure 5. ment prevents the compiler from deleting all instructions. The second modi cation reduces the cost for the compare The scheduler processes the instructions within a block and the branch instruction by decreasing the corresponding by descending weights. When the algorithm reaches the return value in the function alpha adjust costs in the le end of the block, only the multiplications are left. The alpha.c. See gure 6. pipeline bubbles caused by the stall between these multipliAfter either modi cation, we recompiled the GCC and cations can not be lled, so the scheduler has been trapped ran both benchmarks as before. The overall results are in a local minimum. shown in table III. Both modi cations worked about The problem is due to the fact that the global informa- equally well, and there was no slowdown on any loop kernel tion fed into the greedy algorithm does not re ect such or test program. dependencies. To expose the slowdown, two imul instrucVII. Conclusions tions were chosen (which use the multiplier for 22 cycles), and add instructions to ll (a number of 22). The unWe summarize the problems encountered in our analysis scheduled program needs 5.8 seconds to run, the scheduled of the GNU instruction scheduler: program needs 7.6 seconds. This program can be modi ed Some typical features of current superscalar technology to demonstrate the same eect on all other machines. cannot be modeled adequately in the prescribed format. The problem could be solved by inserting additional Neither can the machine description directly express that edges into the DFG, representing such dependencies. This a processor is superscalar, nor can the conditions for issushould be done only for instructions which are blocking a ing several instructions within a single cycle completely be function unit for a long time. Furthermore, they should be speci ed with reasonable eort. There is also no means to inserted before calculating the weights and removed after- force instruction word alignment, if desired.
long int a2, b2, c; void main() { int i; long int a, b;
}
for (i = 2; i