DISSERTATION INTER-BLOCK CODE MOTION ... - CiteSeerX

4 downloads 0 Views 589KB Size Report
Philip Hamilton Sweany ..... ple (and short) instruction format but allowing multiple instructions to begin execution ...... In short then, delaying register assignment.
DISSERTATION INTER-BLOCK CODE MOTION WITHOUT COPIES

Submitted by Philip Hamilton Sweany Department of Computer Science

In partial ful llment of the requirements for the degree of Doctor of Philosophy Colorado State University Fort Collins, Colorado Fall 1992

ABSTRACT OF DISSERTATION INTER-BLOCK CODE MOTION WITHOUT COPIES Code motion is an important optimization for any compiler, and the necessity to include instruction scheduling in compilers for instruction-level-parallel (ILP) architectures makes code motion even more important in compilers for such architectures. Currently popular global scheduling techniques such as trace scheduling allow interblock code motion during scheduling, but require compensation copies which may make them less useful for many popular ILP architectures. This work measures the amount of inter-block code motion possible when compensation copies are not allowed and develops a global scheduling technique, dominator-path scheduling, which relies on such inter-block motion without copies. Tests show that dominator-path scheduling outperforms trace scheduling for a test suite of C programs compiled for the IBM RISC System/6000, a popular superscalar computer. Philip H. Sweany Department of Computer Science Colorado State University Fort Collins, Colorado 80523 Fall 1992

2

CONTENTS

1 Introduction

1.1 ILP Architectures : : : : : : : 1.2 Compiler Terminology : : : : 1.3 Research Goals : : : : : : : : 1.4 ROCKET : : : : : : : : : : : 1.4.1 The Compiler : : : : : : : : 1.4.2 Who Contributed to What : 1.5 Thesis Organization : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

2 Code Motion

2.1 Analysis Necessary for Code Motion : : : : : 2.2 Loop Invariant Code Motion : : : : : : : : : : 2.2.1 Basic Algorithm : : : : : : : : : : : : : : : 2.2.2 Improvements on the Basic Algorithm : : : 2.2.3 Loop Invariant Code and ILP Architectures 2.3 Instruction Scheduling Optimizations : : : : : 2.3.1 Local Instruction Scheduling : : : : : : : : : 2.3.2 Trace Scheduling : : : : : : : : : : : : : : : 2.3.3 Percolation Scheduling : : : : : : : : : : : : 2.3.4 Loop Optimization : : : : : : : : : : : : : : 2.4 Register Assignment and Code Motion : : : : 2.4.1 Register Assignment : : : : : : : : : : : : : 2.4.2 When to Perform Register Assignment : : : 2.4.3 Tradeo s : : : : : : : : : : : : : : : : : : :

3 Code Motion Without Copies

3.1 Dominator-Based Code Motion : : : : : : : 3.2 Reif and Tarjan's Algorithm : : : : : : : : : 3.3 Dominator Analysis and Dominator Motion 3.4 The Algorithms : : : : : : : : : : : : : : : : 3.4.1 RTEB Algorithm : : : : : : : : : : : : : : 3.4.2 Dominator Analysis Algorithm : : : : : : 3.4.3 Ancillary Algorithms : : : : : : : : : : : :

3

: : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : :

1 1 3 4 7 7 7 8

9

9 12 12 14 15 16 16 17 18 19 20 21 22 25

26 26 28 33 37 37 40 42

4 Measurement of Inter-Block Code Motion

4.1 4.2 4.3 4.4

Test Programs : : : : : : : : : : : : : Experimental Results : : : : : : : : : : Code Motion and Register Assignment Summary : : : : : : : : : : : : : : : :

: : : :

: : : :

: : : :

: : : :

: : : :

5 ROCKET Code Generation

5.1 Code Generation Phases : : : : : : : : : : : : : 5.2 ROCKET Data Dependency DAGs : : : : : : : 5.3 Machine-Independent Code Selection Algorithm 5.4 DDD Coupler/Decoupler : : : : : : : : : : : : : 5.4.1 Coupler : : : : : : : : : : : : : : : : : : : : : 5.4.2 Decoupler : : : : : : : : : : : : : : : : : : : : 5.5 ROCKET Trace Scheduling : : : : : : : : : : : 5.5.1 Pair-Wise Trace Scheduling : : : : : : : : : :

6 Dominator-Path Scheduling

6.1 Dominator-Path Scheduling Overview : 6.2 Dominator-Path Scheduling Algorithm 6.3 Instruction Scheduling Considerations 6.3.1 List Scheduling : : : : : : : : : : : : 6.3.2 Transient Timing Problems : : : : : 6.3.3 Scheduling DDDs with Control Flow 6.4 Summary : : : : : : : : : : : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

47 49 50 56 58

59 59 61 62 62 64 64 69 74

77 78 80 86 88 88 89 89

7 A Comparison of Dominator-Path Scheduling and Trace Scheduling 90

7.1 Choosing Dominator Paths and Traces : 7.1.1 Choosing Dominator Paths : : : : : : 7.1.2 Choosing Traces : : : : : : : : : : : : 7.2 Evaluating Schedules : : : : : : : : : : : 7.2.1 Modeling the RS6000 : : : : : : : : : : 7.2.2 Evaluating Local Schedules : : : : : : 7.2.3 Evaluating Dominator-Path Schedules 7.2.4 Evaluating Trace Schedules : : : : : : 7.3 Experimental Results : : : : : : : : : : : 7.4 Conclusions : : : : : : : : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: : : : : : : : : :

: 90 : 91 : 93 : 94 : 95 : 95 : 96 : 96 : 97 : 101

8 Conclusions

102

9 REFERENCES

105

8.1 Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 102 8.2 Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 103 8.3 Concluding Remarks : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 104

4

LIST OF FIGURES 1.1 1.2 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 6.1 6.2 6.3 6.4 7.1 7.2

Eight Queens Source Code : : : : : : : : : : : : : : : : : : : : : Eight Queens Control Flow Graph : : : : : : : : : : : : : : : : Marking Invariant Code : : : : : : : : : : : : : : : : : : : : : : Conditions to Move Invariant Code : : : : : : : : : : : : : : : : Possible Placements for Register Assignment : : : : : : : : : : : Quicksort C Source Code : : : : : : : : : : : : : : : : : : : : : : Quicksort Control Flow Graph : : : : : : : : : : : : : : : : : : : Dominator Motion for an Intermediate Statement : : : : : : : : Reif and Tarjan's IDef Computation : : : : : : : : : : : : : : : Dominator Analysis : : : : : : : : : : : : : : : : : : : : : : : : : Initialization Algorithm : : : : : : : : : : : : : : : : : : : : : : Evaldef Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : Compress Algorithm : : : : : : : : : : : : : : : : : : : : : : : : Inverse Dominator Path Algorithm : : : : : : : : : : : : : : : : Code Hoisting Algorithm : : : : : : : : : : : : : : : : : : : : : : Phases of the ROCKET Compiler : : : : : : : : : : : : : : : : : Sample De nition-Use Tracks : : : : : : : : : : : : : : : : : : : Coupler Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : A variable is used-before-de ned in Bottom and Live out in Top A variable has no track in Top : : : : : : : : : : : : : : : : : : : The variable is not used-before-de ned in : : : : : : : : : : : Decoupler Algorithm : : : : : : : : : : : : : : : : : : : : : : : : Both the de nition and all uses of a track are in \new-list" : : : The de nition and some of the uses are in \new-list" : : : : : : Only some uses of a track are in \new-list" : : : : : : : : : : : : Dominator and Reverse Dominator Trees for Eight Queens : : : Transistion DDDs : : : : : : : : : : : : : : : : : : : : : : : : : : Initialize Transition DDD Algorithm : : : : : : : : : : : : : : : Dominator-Path Scheduling Algorithm : : : : : : : : : : : : : : Choosing Dominator Paths : : : : : : : : : : : : : : : : : : : : Choosing Traces : : : : : : : : : : : : : : : : : : : : : : : : : : 5

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

4 5 13 14 23 30 31 38 41 43 44 45 45 46 48 60 64 65 66 67 68 70 71 72 73 84 85 86 87 92 93

LIST OF TABLES 3.1 4.1 4.2 4.3 4.4 4.5 4.6 4.7 7.1 7.2 7.3 7.4 7.5 7.6

Dominators and IDefs for Quicksort : : : : : : : : : : : Test Programs : : : : : : : : : : : : : : : : : : : : : : Code Motion Possible | Register Alu : : : : : : : : : Code Motion Possible | Mixed Alu : : : : : : : : : : Code Motion Possible | Orthogonal Alu : : : : : : : : Code Motion Possible | Multiple Passes : : : : : : : : Motion Possible | Early vs. Late Register Assignment Registers Required for Code Motion : : : : : : : : : : : Path Sizes : : : : : : : : : : : : : : : : : : : : : : : : : On-Path Improvement After Register Assignment : : : Global Improvement After Register Assignment : : : : On-Path Improvement Before Register Assignment : : Global Improvement Before Register Assignment : : : Registers Required After Scheduling : : : : : : : : : :

6

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: : : : : : : : : : : : : :

: 32 : 50 : 52 : 53 : 54 : 55 : 57 : 58 : 94 : 98 : 99 : 99 : 100 : 101

Chapter 1 INTRODUCTION Many of today's computer applications require computation power which cannot be easily obtained using conventional computer architectures which exhibit little instruction-level parallelism and simple instruction timing. Because of a need for more computation power, several varieties of parallel architectures continue to be investigated. One promising alternative to simple processors is building computers that provide greater instruction-level parallelism, allowing more computation during each machine cycle. Such instruction-level parallel (ILP) architectures allow parallel computation of the lowest level machine operations such as memory loads and stores, integer additions, and oating point multiplications within a single instruction sequence. This is a fundamentally di erent form of parallelism from that exhibited by what Flynn [Fly66] has called MIMD (multiple-instruction, multiple-data) computers. In MIMD architectures (such as the Sequent Symmetry [Seq87] and the Alliant FX28 [All91]) large sections of code are run in parallel on independent processors, each with its own instruction stream and program counter. ILP architectures, in contrast, contain multiple functional units and/or pipelined functional units but have a single program counter and operate on a single instruction stream. To e ectively use parallel hardware, compilers must identify the appropriate level of parallelism. For MIMD architectures, this requires identifying large sections of code that can execute independently and simultaneously. For ILP architectures e ective hardware usage requires that the single instruction stream be ordered such that, whenever possible, multiple low-level operations can be in execution simultaneously. This ordering of machine operations to e ectively use an ILP architecture's increased parallelism, typically called instruction scheduling, is a form of code motion not typically found in compilers for non-ILP architectures. At this research e ort's root is a desire to build a retargetable compiler that translates a high-level, imperative language into excellent code for architectures that require instruction scheduling to make e ective use of instruction-level parallelism.

1.1 ILP Architectures A key feature of ILP architectures is their ability to execute multiple machine operations during a single machine execution cycle. An architecture can provide this parallelism in several ways. These include: LIW Long-Instruction-Word (LIW) computers are characterized by a wide instruction word which includes separate elds for each of several operations which

2 can be initiated during an instruction execution. Instruction scheduling for LIW architectures consists of encoding the instruction elds to specify the machine operations to be performed during the execution of that instruction. During program execution a single instruction is executed for each machine cycle. To account for delays that occur when no new machine operation can be executed, NOPs are added to the instruction stream. Examples of LIW computers include Pixar's Chap [LP84] and the ESIG-1000 from Evans and Sutherland [Eva89]. Superscalar These computers add instruction-level parallelism by retaining a simple (and short) instruction format but allowing multiple instructions to begin execution during a single execution cycle. Typically, superscalar architectures include additional hardware to determine, at run time, whether two or more instructions can be executed simultaneously. Instruction scheduling for superscalar architectures involves ordering simple instructions to reduce dependencies between adjacent instructions and thus maximize the hardware's ability to execute multiple instructions simultaneously. Examples of superscalar computers include the IBM RISC System/6000 (RS6000) [BGM90], the Intel i860 [Int90], and the Intergraph Clipper [Pat90] VLIW Very-Long-Instruction-Word (VLIW) computers di er from LIW in kind as well as size. Their long instruction word (on the order of 1000-2000 bits for current machines) is a result of replicated simple processors whose instruction words are concatenated into a single instruction word. As in LIW computers, instruction scheduling consists of specifying as many elds as possible for each instruction and a single instruction is executed during each machine cycle. The Multi ow TRACE series of computers [CNO+88] are VLIW computers, as are computers being built at IBM, described in [EN89]. SuperPipelined These computers are de ned by Jouppi and Wall [JW89] to be computers in which the machine cycle time is shorter than the latency of any functional unit. In such machines, multiple operations are overlapped because di erent operations can be at di erent stages of computation during the same cycle. Instruction scheduling for superpipelined architectures consists of ordering operations to minimize delays occurring when a needed operand is not yet available. The Cray architectures are examples of superpipelined machines. Even though superpipelined machines are considered a separate class of ILP architecture, pipelining itself has been a common feature of architectures for a long time and is included in all high-performance modern computers. Superscalar architectures like the RS6000 and i860 include pipelines for oating point operations, as do VLIW and LIW machines. When considering real architectures, a compiler needs to combine scheduling of pipelines with whatever other features provide for instruction-level parallelism. A more complete discussion of the di erent forms of ILP architectures can be found in Hennessy and Patterson's architecture text [HP90].

3

1.2 Compiler Terminology Since the following compiler terms will be used throughout the remainder of this document, some de nitions are in order:

Machine Operations are tasks performed by the computer at execution time.

Examples of machine operations are arithmetic computation, data copying from one memory location to another, and data movement in a pipeline. Instructions are a speci cation of a group of machine operations which begin execution during one execute phase of the fetch/execute cycle. For LIW and VLIW architectures this is equivalent to a single assembly language instruction containing slots (or elds) in which machine operations are explicitly speci ed. An instruction for a superscalar architecture, in contrast, is a group of machine operations which may be included in several assembly language instructions but which can be all begin execution during the same execution cycle. Instruction Scheduling is the process of ordering machine operations to reduce the machine cycles necessary to execute a function. Basic Blocks are straight-line sequences of code (either intermediatecode or scheduled instructions) which have a single entrance and a single exit. If execution enters a basic block, all the operations speci ed in the block will be executed. Control Flow Graphs are representations of the control paths a function's execution may take. The control ow graph's vertices are basic blocks, and the edges represent a possible control path from one basic block to another. To illustrate this concept, Figure 1.1 shows the source code for the function \queens," (one of three functions of an eight queens program1 and Figure 1.2 shows the corresponding control ow graph, with slightly amended C source code listed for each basic block. (The C source code in the basic blocks is unaltered except for additional goto statements to represent the ow of control.) Local Optimization is optimization restricted to a single basic block. Local optimization requires no knowledge of the contents or even existence of other blocks in the control ow graph. Global Optimization is optimization performed within the context of more than a single basic block.

An 8 queens program nds 8 squares of a chessboard such that if a queen were placed on each of the 8 squares, no queen could reach a square occupied by any other queen in a single chess move allowed a queen. 1

4 queens(c) { int r; for (r = 0; r < 8; r++) if (rows[r] && up[r-c+7] && down[r+c]) { rows[r] = up[r-c+7] = down[r+c] = 0; x[c] = r; if (c == 7) print(); else queens(c + 1); rows[r] = up[r-c+7] = down[r+c] = 1; } }

Figure 1.1: Eight Queens Source Code

1.3 Research Goals Within the broader aim of building an easily targeted compiler that generates excellent code for a broad spectrum of ILP architectures, this dissertation investigates several important aspects of compilation related to code motion between blocks of a control ow graph. Code motion is an important aspect of an optimizing compiler, as it can lead to more ecient code. For example, movement of loopinvariant code is a \standard" optimization which attempts to move code out of a loop, and, as such, is found in virtually all optimizing compilers, whether for ILP architectures or not. In addition to more traditional code motion such as loop invariant code hoisting, ILP architectures can bene t from other forms of inter-block code motion. Consider an architecture which has relatively long pipelines. Throughput can be raised if we keep the pipes lled. Often, the ability to keep pipelines lled can be enhanced if the optimizer moves pipeline operations in the control ow graph to group similar pipeline operations together. Instruction scheduling is another optimization important in compilers for ILP architectures. While local instruction scheduling (limited to a basic block) is an example of intra-block motion, global instruction scheduling requires inter-block motion. Generally, global scheduling is preferred because it can take advantage of added program parallelism available when motion across block boundaries is allowed. Tjaden and Flynn [TF70], for example, found parallelism within a basic block quite limited. Using a test suite of scienti c programs, they measured an average parallelism of 1.8 within basic blocks. In similar experiments on scienti c programs in which inter-block motion was allowed, Nicolau and Fisher [NF84] found

5

B0

r=0

B1

if r >= 8 goto B10

B2

if rows[i] == 0 goto B9

B3

if up[r-c+7] == 0 goto B9

B4

if down[r+c] == 0 goto B9

B5

rows[r] = 0 up[r-c+7] = 0 down[r+c] = 0 x[c] = r if c == 7 goto B7

B6 queens(c+1)

B10 EXIT XX XX XX

B9 r=r+1 goto B1

  

print()

B7

rows[r] = 1 B8

up[r-c+7] = 1 down[r+c] = 1

Figure 1.2: Eight Queens Control Flow Graph

6 parallelism which ranged from 4 to virtually unlimited, with an average of 90 for the entire test suite. Two popular global instruction scheduling techniques which rely on inter-block motion of operations are trace scheduling [Fis79, Ell85, CNO+88] and percolation scheduling [Nic85, AN88b, Aik88, EN89]. (Both are discussed in more detail in chapter 2.) These techniques both allow massive migration of operations from one basic block to another. Since movement of operations can potentially change the function computed, some mechanism is required to ensure that the original program semantics are maintained. Both percolation scheduling and trace scheduling ensure program semantics by including multiple copies of those operations which move between basic blocks. These copies are called compensation code since they compensate for the e ects of operations' movement. This ability to include compensation code to preserve the original program semantics removes many restrictions on moving operations among blocks but has the tradeo of potentially requiring more operations to be executed since some operations must be copied and then executed at multiple places within the program. The rationale for such code motion with accompanying compensation copies is that, for machines which perform many operations in a single instruction, the added motion will permit better instruction schedules. Thus, more operations will be available to schedule for each instruction and fewer \holes" (unused instruction-level parallelism) will arise in the scheduled code. Such techniques assume that this better scheduling overshadows the negative e ects of extra (copied) operations which will be scheduled (hopefully) in existing holes of instructions in the block(s) where they are added. While some ILP architectures (e.g., VLIW computers) have enough instructionlevel parallelism to make code motion with compensation copies attractive, certainly not all do. For most current superscalar and superpipelined machines, compensation copies might lead to slower execution times unless we restrict operation movement to reduce the required compensation copies. Research using trace scheduling and percolation scheduling has shown that a large amount of code motion is possible between blocks of the control ow graph if we allow semantic-preserving, compensation copies. The overall goal of this research is to investigate the code motion possible between basic blocks when not allowing compensation copies. If sucient code motion without copies is available, perhaps the bene ts of global scheduling techniques can be derived without paying the penalty of scheduling some operations multiple times. Thus, the three major subgoals in this investigation of inter-block code motion without compensation copies are: 1. To construct an algorithm which determines the set of blocks to which an intermediate statement can be moved while still preserving program semantics and while not requiring multiple copies of the statement. 2. Given such an algorithm, to measure the amount of code motion available within a test suite of C programs. 3. To construct an algorithm to perform global instruction scheduling but without requiring semantic-preserving copying of machine operations, and to com-

7 pare that algorithm to a popular global scheduling algorithm which does require such copies.

1.4 ROCKET ROCKET is an o shoot of the Horizon compiler [MDSW88], also developed at Colorado State University. Like Horizon, ROCKET focuses on machine resource usage as the primary issue in both retargetability and production of highly-optimized assembly code. ROCKET targets to a wide variety of ILP architectures assumed to have a single control store and synchronous execution.

1.4.1 The Compiler

To translate C into highly-optimized code for ILP architectures, ROCKET rst produces an abstract representation of the input C program and then performs 1) global optimization, which massages the intermediate representation to improve expected program speed; 2) code selection, which replaces abstract representations of C statements with collections of machine operations; 3)parallelization, which determines resource dependencies and timing; 4)local instruction scheduling, which assigns machine operations to (hopefully a minimum number of) instructions to satisfy data-dependency and machine-resource constraints; and 5)register assignment, which replaces symbolic references with speci c machine register references. ROCKET's global optimization includes common subexpression elimination, copy propagation, constant folding, constant propagation, algebraic simpli cation, and reduction in strength. Aho, et al. [ASU86]. describe these \traditional" compiler optimizations.

1.4.2 Who Contributed to What

In any large programming e ort, it is dicult to accurately assign credit to all who have contributed, and even harder to determine exactly whose ideas were used where. Still, I must make the attempt. Since the ROCKET compiler is in some sense a descendent of Horizon, I will take a historical perspective. In the original Horizon compiler (which itself grew out of Robert Mueller's dissertation [Mue80] and subsequent work by Mueller and Joseph Varghese [MV85, MD86]), Vicki Allan was primarily responsible for instruction scheduling and performed much of the preliminary register assignment work. Michael Duda did virtually all the front end work. Steven Beaty redid and ne-tuned the register assignment algorithms, and I developed most of the program analysis, such as determination of symbolic covers and memory reference disambiguation. I also implemented most of the \standard" optimizations on intermediate code. Steve Beaty and I implemented all of the current ROCKET compiler, except for the recursive descent parser, lcc, provided by Dave Hanson [FH91]. Beaty implemented register assignment and local instruction scheduling as well as the machine description interface. I constructed intermediate code from the recursive descent

8 parser we used, and implemented the analysis and optimization of the intermediate code, as well as the code motion algorithms presented in Chapter 3, and both trace scheduling and dominator-path scheduling.

1.5 Thesis Organization The remainder of this dissertation is divided into seven chapters. Chapter 2 investigates compiler issues relevant to code motion. This includes a discussion of currently used code motion techniques, a discussion of compiler analysis bearing upon the ability to perform code motion and a discussion of how code motion and register assignment a ect each other. Together, Chapters 3 and 4 describe a mechanism to perform inter-block motion (without copies) of machine operations, and a measurement of the amount of such motion possible within C programs. The measurement of motion without copies is important in evaluating such motion's applicability to global instruction scheduling. If a considerable amount of such motion is possible, a global scheduling technique based upon motion without copies might be very e ective in scheduling code for ILP architectures. Chapter 3 presents the algorithms to move intermediate statements between blocks (without requiring semantics-preserving compensation copies). Chapter 4 presents measurements of the code motion possible in C programs, using the algorithms de ned in Chapter 3. Chapter 5 provides an overview of ROCKET and investigates in depth those features of ROCKET bearing on code motion. This includes a description of the trace scheduling method used in ROCKET as well as a discussion of the coupler/decoupler which ROCKET uses extensively in both trace scheduling and the new global scheduling technique introduced in Chapter 6. Chapter 6 introduces a new global scheduling technique, dominator-path scheduling which is based upon inter-block code motion without compensation copies. The motivation to build such a scheduler is based on a desire to make better use of limited hardware than is possible with global scheduling techniques which require compensation copies. The amount of code motion discovered by the experiments described in Chapter 4 suggests that such a scheduling algorithm might be quite useful for many ILP architectures. Chapter 7 provides experimental evidence to compare dominator-path scheduling with trace scheduling for a popular superscalar architecture, the RS6000. Chapter 8 summarizes the contributions of this work and describes some areas for future research.

Chapter 2 CODE MOTION Historically, enormous e ort has been directed toward re ning compilation techniques for diverse classes of computer architectures. This has resulted in a rich body of code-transformation methods whose overall goal is to improve the eciency of generated code. Catalogues of such code-improvement transformations are widespread in the literature [AC72, ASU86, FL91]. This chapter reviews issues involved in one such transformation, namely code motion. Code motion is a transformation which reorders code in an attempt to reduce the execution time required by the compilation product. Traditionally, code motion has been limited to hoisting loop-invariant code out of loops to (hopefully) lessfrequently executed portions of the control ow graph, leading to more ecient code. When compiling for ILP architectures another form of code motion becomes critical, namely instruction scheduling, which reorders code to better use the target machine's instruction-level parallelism. Within the context of investigating code motion opportunities in compilers for ILP architectures, this chapter investigates code motion techniques currently in widespread use. To perform code motion (or any other optimization), a compiler must extensively analyze some form of the source program. Section 2.1 reviews widely used program ow analysis techniques. Section 2.2 describes techniques used for loop-invariant code motion. Section 2.3 describes instruction-scheduling optimizations. Finally, since the relative placement register assignment and code motion has a large impact upon each's ability to produce excellent code, Section 2.4 investigates the issues involved in the placement of register assignment in the compilation process.

2.1 Analysis Necessary for Code Motion Almost all optimizations applicable to ILP compilers, whether traditional optimizations such as loop-invariant code motion or non-traditional optimizations which extract parallelism such as instruction scheduling, rely heavily on extensive program analysis. Providing more information to optimization routines generally results in more opportunities for safe program transformations. This section brie y surveys analysis techniques which have been shown to be important to generating excellent code. While these analysis techniques were developed either for traditional non-ILP architectures or MIMD machines, they each have a signi cant impact on the quality of code generated by such ILP compiler optimizations as instruction scheduling.

10 Most code transformations (and, indeed, most analysis methods discussed here) require live-variable analysis which determines the program variables containing values in the basic blocks of a function's control ow graph. This determination is usually broken into two steps, local live analysis and global live analysis. Local liveness analysis identi es (for each basic block) those program variables referenced and those de ned within the block. Using local live information, global live analysis determines which program variables contain values upon entry-to (live-in variables) and which contain active values on exit-from (live-out variables) each block. There exist several well-known methods of performing live-variable analysis [Hec77, MJ81, ASU86]. Extensions of live-variable analysis include determination of where each de ned value is referenced in the program (reaching de nitions) and determination, at each point in a program, of all variables containing an active value (available expressions). Reaching de nition and available expression analyses provide information useful to code motion and we shall see them used in algorithms described in Section 2.2.1 Since both reaching de nitions and available expressions are computed using iterative algorithms, they require computation time which is quadratic in the size of the control ow graph. Reif and Tarjan's method of symbolic program analysis [RT81] provides information similar to reaching de nitions and available expressions, but requires less computation time. The base their work on a fast algorithm to compute dominators for a program control ow graph [LT79]. A basic block, A, in a control ow graph dominates another block, B, if any path from the initial block of the graph through B must pass through A. Using the fast dominator algorithm, Reif and Tarjan compute origins for each variable referenced in a program2 in time almost linear to the number of source program statements. Using the origin computation, Reif and Tarjan show how a global symbolic execution DAG (Directed Acyclic Graph) can be built in linear time. This global symbolic execution DAG gives a global representation of all the program's expressions. The information resembles that obtained by reaching de nitions and available expressions and, thus, can be used to perform optimizations, including code motion. The advantage of Reif and Tarjan's analysis method is that it is faster than reaching-de nition and available-expression analyses. Another analysis important for generating good scheduled code is accurate determination of data dependencies among a program's operations. Instruction scheduling typically involves building a data dependency DAG (DDD) for each basic block in a program's control ow graph. Since a DDD's edges (representing dependencies between operations) inhibit parallelism, we wish to build the graph with as few edges as possible while maintaining program semantics. The diculty is determining exactly which data dependencies are necessary to ensure a correct See Aho et. al [ASU86] pages 610-630 for algorithms to compute reaching de nitions and available expressions. 1

2

A variable reference's origin is the basic block which de nes the value referenced.

11 program. When, after available analysis, a compiler cannot determine whether two references access the same memory location, it generally must make the conservative assumption that they might. To see how this conservative assumption can limit code motion, consider the following example code segment:

x=y+z a=b+c If a compiler cannot determine that x represents a di erent memory location than either b or c, it cannot move the second statement above the rst, i.e. a data depen-

dency exists between the two statements. Because conservative data dependency analysis has a potentially devastating e ect upon optimization opportunities, considerable research e ort has examined analysis techniques that determine whether two operands might access the same memory location. These methods are called memory reference disambiguation. While memory reference disambiguation of the example above may appear straightforward, use of arrays and pointers increases the problem's diculty. We do not wish to assume that all references to an array access the same memory location, but the alternative requires substantial analysis information to determine whether two array references access di erent locations. Methods to disambiguate array references [Nic84, Ell85, Wol82, BC86] usually include a function which accepts some representation of each of two array references and returns one of three answers:

 YES { the two references always share the same memory location.  NO { the two references never share the same memory location.  MAYBE { the two references might share the same memory location. In a sense, either YES or NO proves an acceptable answer. Either a dependency exists or one cannot exist. The troublesome answer, MAYBE, requires a conservative assumption about whether the references represent the same location. As an alternative to the compile-time disambiguation methods discussed here, some researchers have suggested adding hardware to perform disambiguation at run-time [Nic89]. Run-time disambiguation can often resolve ambiguities which we cannot resolve at compile time, but whether the better run-time disambiguation compensates for the added hardware remains an unanswered question. When source languages include pointer dereferences disambiguation becomes even more dicult. In this case, the compiler either conservatively assumes that a pointer can point to any memory location or it traces the values that pointers may take. A description of data ow problems caused by pointer dereferences and a possible framework for \solving" such problems are included in [ASU86]. Guarna [Gua88b, Gua88a] gives algorithms which determine, in many cases, exactly those memory locations which a pointer dereference accesses. Just as array references and pointer dereferences present problems to determining accurate data dependency information, so do procedure calls. Without knowledge of what variables (and memory locations) a procedure uses and de nes, the compiler must conservatively assume the worst case|that the procedure uses and

12 de nes every possible memory location. Again, this may severely limit optimization opportunities. To overcome this diculty, Cooper and Kennedy [CK84, CK88] have proposed algorithms which determine interprocedural data ow, so that less conservative assumptions can be made. The bene t of such analysis remains somewhat in doubt, however. Richardson and Ganapathi [RG89] have suggested that, at least for microprocessors, interprocedural data ow analysis does not result signi cant code improvement (1.57 per cent reduction in execution time for a series of programs.). Even if this is so, ILP architectures may well provide greater opportunity for improved code using interprocedural data ow analysis since they provide greater potential for parallelism. The major perceived advantage of using interprocedural data ow analysis when compiling for ILP architectures is not that traditional optimizations such as common subexpression elimination will have more e ect but that interprocedural analysis is necessary to accurately determine data dependencies, which itself is mandatory for the discovery of potential parallelism. Another technique which I shall consider analysis is procedure inlining, where the compiler replaces procedure calls, wherever possible, with the called procedure's code. While this technique is well-known as an optimization which attempts to speed a program's execution by eliminating function call overhead, inlining has another potential advantage in that it might enhance other optimizations by making larger \chunks" of a program available for analysis and optimization. Holler's dissertation [Hol91] investigates inlining's e ects on the size and speed of executable code for a number of architectures, including at least one superscalar machine (the InterGraph Clipper), and concludes that, contrary to commonly held belief, code explosion due to inlining does not increase paging. Inlining, she reports, can lead to less ecient code at times, due not to increased paging but rather to exacerbating register allocation problems.

2.2 Loop Invariant Code Motion Loop invariant code hoisting moves code out of loops into less frequently executed parts of a program. It may seem unlikely that typical programs would include many examples of loop-invariant code since we would expect a knowledgeable programmer to include in loops only absolutely necessary computations. However, array addressing often leads to loop-invariant code, particularly after other transformations such as common subexpression elimination and copy propagation are performed. In fact, hoisting of loop-invariant code has proven useful enough to be included in virtually all optimizing compilers. This section describes some of the popular techniques used in implementing such code hoisting and looks at expected bene ts of loop-invariant code hoisting for ILP architectures.

2.2.1 Basic Algorithm

Most modern compiler textbooks [ASU86, FL91] give algorithms to hoist loopinvariant code. The basic algorithm discussed here follows that of Aho et al. [ASU86].

13 Algorithm Mark Invariant Code() Input: A loop, L, consisting of a set of basic blocks For each block, B, in L, a sequence of three-address statements Reaching De nitions for each operand, O, of each statement S, in L Output: The set, I, of statements invariant in L Algorithm: I=; Foreach Basic Block, B, in L Foreach Statement, S, in B If each operand O, of S is either i) constant OR ii) all O's reaching de nitions are outside L add S to I Changed = TRUE While Changed is TRUE Changed = FALSE Foreach Statement, S, in each Basic Block, B, in L If each operand O, of S is one of i) constant OR ii) all O's reaching de nitions are outside L OR iii) O's has a single reaching de nitions which is in I add S to I Changed = TRUE Figure 2.1: Marking Invariant Code Hoisting of loop-invariant code requires rst identifying, for each program loop, L, all loop-invariant code in L and then moving such code to a new basic block called a loop preheader. Let H be the block which represents the entry into loop L. The loop preheader is a block inserted immediately prior to H in the control ow graph. Its only successor is H. All blocks which originally were predecessors of H are made predecessors of the new loop preheader. Figure 2.1 gives the algorithm for nding invariant code, given a loop, L. Once the invariant statements of a loop are identi ed, some can be moved to the loop preheader. Speci cally, given a generalized 3-address statement, S

x

y op z

the compiler can move S to the loop preheader if the conditions listed in Figure 2.2 are met. Consider Figure 2.2. Conditions 3 and 4 ensure that moving S out of the loop will not cause some use of x to receive a di erent value for x than it would if S were

14 1. 2. 3. 4.

S is invariant in L.

That block which contains S dominates all exits from L. S is the only statement in L which de nes x. No use of x in L is reached by any de nition of x other than S. Figure 2.2: Conditions to Move Invariant Code

left in the loop. Speci cally, Condition 4 ensures that this is true for those uses of x in the loop while 3 ensures that no use of x outside the loop will receive the wrong value. Condition 2 is required to ensure that x will only be de ned if program ow guarantees that statement S will be executed at least once before execution leaves L3 .

2.2.2 Improvements on the Basic Algorithm

The basic algorithm for hoisting invariant code out of loops su ers from two major problems. It is overly conservative, missing some hoisting opportunities, and it is inecient in both time and space since it requires both reaching de nitions and multiple passes over all loop statements to identify invariant statements. Thus, considerable research has been directed toward improving the basic algorithm. The overly conservative nature of the basic algorithm is due to Conditions 2 and 3 of Figure 2.2. Both are meant to ensure that uses of x outside loop L receive the same value for x that they would have without hoisting. Often, however, x will be used only within L itself. In such cases, Conditions 2 and 3 can be ignored. The algorithms discussed in this section take a more general approach to program data ow and allow more hoisting, as appropriate. The rst improvement to the basic algorithm is described in Reif's dissertation and a subsequent paper [Rei77, Rei80]. Reif's method is based upon information computed by symbolic program analysis (the basis of his joint work with Tarjan [RT81]) and is described brie y in Section 2.14. Reif's method does not move statements at all, but rather it recognizes invariant expressions and moves only those. Thus, for the generalized statement

x y op z Reif adds the statement ti y op z 3

This test is needed to ensure that x will not be de ned if the loop is executed 0 times.

This symbolic program analysis is also the foundation for the code motion algorithms developed here and is discussed in detail in Chapter 3. 4

15 to the loop preheader and replaces the original statement with

x

ti

While this still leaves an invariant assignment in the loop, it removes the invariant expression itself. By ignoring de nitions, Reif's method hoists expressions which could not be moved using the basic algorithm. A more signi cant advantage of Reif's method is speed. Since it is based upon his ecient symbolic program analysis method, Reif's code motion requires time which is almost linear in the size of the control ow graph being analyzed. The actual order statistic for the algorithm is: O(m (m; n) + l) where n is the number of vertices (basic blocks) of the control ow graph, m is the number of arcs of the control ow graph, l is the number of expressions in the program and is related to a functional inverse of Ackermann's function, which is an extremely fast-growing function. Morel and Renvoise describe another improvement on the basic hoisting algorithm [MR79]. The key to their technique is the realization that loop-invariant code hoisting is a special case of a larger problem, that of suppressing partial redundancies. A computation is partially redundant if it is performed more than once in some execution path of a program. Morel and Renvoise's algorithm does not identify loops at all but rather suppresses partial redundancies wherever possible. Their algorithm depends upon maintenance of several bit vectors which include a bit for each expression in the program. Constructing these vectors requires time which is quadratic in the number of basic blocks of a program and requires space which is linear in the number of program expressions. As such, the method is faster than the basic technique, but slower than Reif's method. An extra bene t of the Morel and Renvoise method is that additional optimization is included beyond loop-invariant code hoisting. As with the Reif algorithm, this method hoists only expressions, not entire statements. Chow [Cho84] uses the Morel and Renvoise method with minor modi cations. Chow extends the method to allow hoisting of de nitions by treating the assignment operator as a binary operator. This requires some additions to the data ow bit vectors maintained by Morel and Renvoise but does not add to either the time or space complexity of their original method.

2.2.3 Loop Invariant Code and ILP Architectures

The prospect of executing code fewer times motivates hoisting loop invariant code, based upon the assumption that the loop will execute more than once. For conventional architectures with little concurrency, this provides a sucient reason to move invariant code outside of loops. Consider an ILP architecture, however. Loop invariant operations might be overlapped with other (loop variant) operations within the loop during instruction scheduling. Thus, within the loop, loop invariant code may well be executed for \free" (not require any additional instructions). When hoisted out of the loop, however, the operations might require one or more extra instructions. Clearly, this would not improve code eciency but, rather, make it worse. Of course, it is also possible that, by moving the invariant code, the

16 operations could overlap operations outside the loop when concurrency within the loop proves impossible. Thus, while loop invariant code hoisting almost always improves code eciency for non-ILP architectures, it may not do so for ILPs. As an architecture's available concurrency increases, the chance of overlapping loop invariant operations increases and the perceived value of loop invariant hoisting decreases. In fact, for machines with a high degree of concurrency, moving code into a loop might prove worthwhile. Such code would, by necessity, be loop invariant. However, it is possible that by overlapping such code with operations of the loop, we could execute the operations with no added loop instructions. Although this \optimization" of code motion into a loop seems counter-intuitive, it could actually result in improved eciency for some ILP architectures.

2.3 Instruction Scheduling Optimizations To generate ecient code for an ILP architecture, a compiler must order instructions to exploit low-level, ne-grain parallelism. Four popular instruction reordering methods are local instruction scheduling, trace scheduling, percolation scheduling and software pipelining. As the name implies, local instruction scheduling attempts to maximize parallelism within each basic block of a function's control ow graph. Both trace scheduling and percolation scheduling present global instruction scheduling techniques, i.e., they generalize local instruction scheduling to exploit parallelism among the basic blocks of a function's control ow graph. Software pipelining exploits parallelism between di erent iterations of loops.

2.3.1 Local Instruction Scheduling

Local instruction scheduling typically requires two phases. In the rst phase, a data dependency DAG (DDD) is constructed for each basic block in the function. DDD nodes represent operations to be scheduled. The DDD's directed edges indicate that a node x preceding a node y constrains x to occur no later than y. More sophisticated systems [Veg82, All86] label edges from x to y with a pair of nonnegative integers (min,max) indicating that y can execute no sooner than min cycles after x and no later than max cycles after x. For example, if x placed a value on a bus that y read, an edge from x to y would establish a \data dependency" with timing (0,0) indicating that the read must follow the write in the same instruction. In contrast, if x assigned a value to a register subsequently read by y and the target machine did not permit reading a register after it is written in the same instruction, the edge from x to y would include timing (1,1), specifying that y must follow x by at least one instruction, but can be placed any number of instructions after x. Given a DDD, instruction scheduling attempts to order the nodes in the graph in the shortest sequence of instructions, subject to (1) the constraints in the graph, (2) the resource limitations in the machine (i.e., a machine resource can typically hold only a single value at any time), and (3) the eld-encoding con icts that may exist should several operations share a common instruction eld. In general, this optimization problem is NP-complete [Rob79]. However, in practice, heuristics can

17 achieve good results. A good survey of early instruction scheduling algorithms occurs in [LDSM80], while [Veg82] and [All86] give more sophisticated algorithms. Beaty [Bea91] provides an excellent summary of current techniques, evaluating many often-used heuristics, and developing new ones. He also describes promising results applying genetic algorithms (GAs) to the instruction scheduling problem. He found that GA techniques produce schedules comparable to those produced by nely-tuned heuristics, without requiring the exhaustive targeting iterations necessary to netune the heuristics. He also reports that the GA scheduler, more robust than a heuristic scheduler, can better cope with a variety of DDDs which are dicult to schedule.

2.3.2 Trace Scheduling

While local instruction scheduling can nd parallelism within a basic block, it can do nothing to exploit parallelism between basic blocks. Both trace scheduling and percolation scheduling (as global instruction scheduling techniques), however, exploit such inter-block parallelism. Trace scheduling optimizes the most frequently used program paths at the expense of less frequently used paths. Fisher originally proposed using trace scheduling to improve microcode eciency by selectively moving operations around basic block boundaries [Fis81]. Subsequently, Fisher designed a VLIW machine (Multi ow Trace5) that relies heavily on the eciency of its trace scheduler for high performance. In trace scheduling, the path to optimize is known as the trace, leading to the terms on-trace (program control paths to optimize) and o -trace (blocks in less frequented program paths). Trace scheduling's basic plan moves code from one block to another to reduce the number of instructions in the on-trace path, possibly at the expense of increasing the number of instructions in the program's o -trace segments. Three principal steps occur in trace scheduling: trace selection, trace instruction scheduling, and bookkeeping. First, trace selection identi es frequently-executed paths. The compiler, using data derived by program analysis, can automatically perform some trace selection. For example, program dominators indicate looping structure, and we can safely assume that the statements in the innermost loop in a nested loop structure execute at least as often as those in an outer surrounding loop. Where we cannot automatically determine path execution frequency, other means of selecting traces exist. Another automated mechanism uses a simulator tool to provide pro ling data on a carefully chosen set of benchmark programs. This data, provided to the trace selector, tunes the estimates of relative path-executionfrequency. Still another trace selection method allows the programmer to explicitly identify traces. Of course, one can mix these methods as appropriate. The second major phase of a trace scheduling optimization is trace instruction scheduling, which treats the entire sequence of basic blocks on a trace as a sin5

Trace is a trademarked name of Multi ow Inc.

18 gle block, using local instruction scheduling to schedule the trace as a single unit. Scheduling multiple blocks on a trace is somewhat more complicated than scheduling a single basic block. To ensure that scheduling preserves the original program semantics, trace scheduling restricts movement of some operations from one basic block to another. Ellis [Ell85] gives an excellent summary of the restrictions necessary to prevent \illegal" transformations. Bookkeeping, the third major phase of trace scheduling, compensates for movement of operations across basic block boundaries, thereby preserving semantics. Such compensation is necessary because, when an operation, O, moves from one basic block, B, to another (on-trace) block during the multi-block instruction scheduling of phase two, there needs to be some guarantee that O's e ects will remain constant on all paths to and from B. This compensation requires that operations moving from one on-trace block to another be copied to an o -trace block as well. While bookkeeping's compensation copies will increase the code size required for a program, the assumption is that program execution time will decrease because the o -trace blocks where compensation code has been placed executes relatively infrequently. Again, Ellis provides an excellent summary of bookkeeping transformations required to ensure semantic preservation during trace scheduling. A major limitation of Fisher's initial trace scheduling proposal is the potential for copying an exponential number of operations (during bookkeeping) to preserve program semantics. Several approaches to limit copying have been proposed [Lin83, LA83, SDJ84, HMS87]. Further, while Ellis has shown that, in practice, trace scheduling improves execution times when scienti c programs are compiled for VLIW architectures, it remains to be seen whether trace scheduling can be as e ective for non-scienti c code where branch frequencies are typically less dramatic and more input-dependent. Also in doubt is trace scheduling's ability to overcome the e ect of compensation copies when compiling for ILP architectures with more limited parallelism than is available in the VLIW model.

2.3.3 Percolation Scheduling

Percolation scheduling [Nic85, AN88b], another global instruction scheduling method, is based on a set of core transformations applicable to a program \graph." This program graph resembles the traditional concept (used throughout this document) of a control ow graph in that the edges represent possible control paths a program might take during execution, but di ers in that each node represents not an arbitrary straightline sequence of code, but rather a set of statements which can be executed in parallel. Like trace scheduling, percolation scheduling requires compensation copies of operations which are moved by the core transformations. It is important to note that percolation scheduling's core transformations represent all \legal" code motions de ned on a program graph. To use percolation scheduling, supplied heuristics indicate when the compiler applies core transformations. One such heuristic, suggested by Aiken and Nicolau [AN88b], moves all code as \high" as possible in the program graph|that is as close as possible to the graph node which represents the entry to the function being compiled.

19 Unlike trace scheduling, percolation scheduling, as originally de ned, depends upon simplifying assumptions about the target architecture. It assumes that each instruction executes in a single machine cycle. This assumption seems well-suited to the Multi ow Trace but is violated by many ILP architectures|limiting percolation scheduling's value for ILP compilers. Also, early percolation scheduling work assumed that no resource con icts occurred in instruction scheduling; essentially, the architecture had unlimited resources. Any real architecture violates this assumption. Subsequent work by Nicolau and Ebcioglu [EN89] modi es the original percolation scheduling algorithm to provide for resource constraints. In the updated method, for an architecture with the ability to execute k operations in parallel, percolation scheduling chooses the k \most important" operations to schedule in any instruction. According to Aiken and Nicolau, however, supporting more complex timing \poses a number of design problems" including changing core transformations' de nitions. Thus, percolation scheduling does not seem well-suited to those ILP architectures with complex timing.

2.3.4 Loop Optimization

While trace scheduling and percolation scheduling (as originally de ned) in conjunction with local instruction scheduling extract parallelism along a function's control ow path, neither is well-suited to nd potential parallelism within loop constructs. In contrast, software pipelining is a technique motivated strictly by a desire to optimize loops. To achieve high performance in a compute engine, it is often advantageous to employ hardware units that produce results in stages. The multi-stage approach, called pipelining [Kog77, Kog81], pays o when we can keep all stages operating concurrently. In similar fashion, software pipelining overlaps a loop body's di erent iterations. This overlapping can be particularly e ective for architectures which include multi-stage pipes. To visualize the opportunities a orded by a pipelined hardware unit, consider a 3-stage functional unit. We need 3n time units to perform n operations when individual operation execution is not overlapped (i.e., only one pipe stage is active at any time). In contrast, with maximal overlap, we need only n + 2 time units to perform n operations. This scenario proves typical in software applications targeted to machines containing pipes. That is, one evaluates an expression (or perhaps many expressions) in the loop's body. Most compilers simply translate the loop body as a basic block, creating non-overlapping pipe segments. In contrast, a smart loop optimizer can analytically determine the maximal number of iterations that can be overlapped. These iterations are then overlapped, or folded, to e ectively begin executing operations that would otherwise execute in subsequent loop iterations. Thus, during execution of any loop iteration, operations from several loop iterations execute in parallel. Software pipelining of loops was rst reported by Charlesworth [Cha81] who described an algorithm used in generating hand-written assembly code for the Floating Point System AP-120B family of array processors. Since then, software pipelining in compilers has generally dealt with the inherent complexity of nding an optimal schedule by limiting the program constructs on which software pipelining was

20 utilized. Touzeau [Tou84] restricts software pipelining to loops of a single Fortran statement. Su et al. de ned two algorithms, each with di erent restrictions: URCR [SDJ84] limits the number of loop iterations which can be overlapped to two; URPR [SDX86] does not restrict the number of overlapped iterations in the loop but insists that the number of loop iterations be known at compile time. In addition, both URCR and URPR apply only to loops consisting of a single basic block. Neither algorithm can pipeline loops containing conditional statements. In contrast, Lam [Lam87, Lam88] removes these restrictions on software pipelining. Using a method she calls hierarchical reduction, Lam provides for pipelining of those innermost loops containing arbitrary block-structured control ow. Hierarchical reduction schedules the blocks of an innermost loop, starting with the innermost control constructs. As each control construct is scheduled, it is replaced with a single node which represents all the resource and data dependency constraints of the entire control construct. This single node is then scheduled along with all other nodes in the surrounding control construct. This hierarchical reduction continues until a single node represents the entire loop. This technique allows Lam to pipeline any block-structured, innermost loop (loops which contain only IF-THEN-ELSE constructs). Lam also uses hierarchical reduction to pipeline some outer loops, by reducing the innermost loop to a single node as a rst step. However, Lam's hierarchical reduction technique is de ned only for control ow graphs which are block-structured, so cannot always be used directly in compilers for source languages which include goto statements. While the original de nition of percolation scheduling did not include software pipelining, two more recent variants do. In [AN88a, Aik88], Aiken and Nicolau describe what they call Perfect Pipelining in which loops are scheduled, unrolled, pipelined and searched for an emerging pattern. Like Lam's work, perfect pipelining schedules loops with multiple basic blocks. Nakatani and Ebcioglu [NE90] describe a slightly di erent technique in their percolation scheduling compiler for a VLIW architecture being built by IBM. In their method, termed enhanced pipeline percolation scheduling, inner loops are pipelined rst, then the strongly-connected component of the inner loop is treated as an atomic region when scheduling outer loops. Like hierarchical scheduling and perfect pipelining, enhanced pipeline percolation scheduling performs software pipelining on multi-block loops.

2.4 Register Assignment and Code Motion Assignment of program values to registers is critical for generation of ecient code. Computers typically have a hierarchy of memory resources which includes (at a minimum) two levels: relatively fast registers and much slower RAM. It is as true for ILP architectures as any other computers that a program's execution time can be cut dramatically by maintaining program values in registers rather than slower memories. This section addresses the interactions between register assignment and code motion which e ect their relative placement in the compilation process. Section 2.4.1 de nes the register assignment problem and describes graph-coloring register assignment, a popular solution to the register assignment problem. Section 2.4.2

21 investigates the issue of when, during compilation, register assignment should be performed. While the discussion here will use graph-coloring register assignment as an example, the issues are the same no matter what register assignment techniques are used.

2.4.1 Register Assignment

The optimization of keeping program values in registers as much as possible consists of two (potentially) distinct problems: register allocation which determines which program values will be placed into a register resource, and register assignment which maps those program values to be allocated to a register to the available machine register set. A popular register assignment technique, used in the ROCKET compiler, is register assignment via graph coloring, commonly attributed to Chaitan [CAC+81, Cha82]. Compared with other, more ad hoc register assignment methods, graph coloring provides a relatively simple, conceptually elegant, solution to the problem of mapping program values to a computer's register set. Graph coloring combines allocation and assignment by allocating each distinct scalar program value to a di erent imaginary (symbolic) register and then mapping the symbolic registers to the physical (hard) registers for the architecture. In this process, graph coloring denotes symbolic registers as graph nodes and places arcs between nodes where those variables are live simultaneously. The solution then involves nding an n coloring of the graph, where n represents the number of the target machine's available hard registers. A graph is considered correctly colored if each node's color di ers from each of its neighbor's. As an architectural paradigm, this implies each symbolic register is assigned a hard register di erent from all other symbolic registers live during the same execution cycles. It is well known that given a graph, G, and a natural number n > 2, the problem of determining whether G is n-colorable is NP-complete [HS80], but several researchers have reported excellent results using simple heuristics to color the interference graph [CAC+81, Cho90, BCKT89]. Chaitan's register assignment is based upon a simple observation about coloring graphs. If we remove a graph node of degree less than n, no matter how its neighbors are colored, at least one color will be left over for it. For example, consider a graph where we attempt a four-coloring. If we remove a node of degree three, each of its neighbors may be assigned any three colors, leaving at least one color for the removed node. Nodes are removed in this manner until the graph is either empty or no remaining nodes have degree less than n. Once a symbolic register-interference graph has been shown to be n-colorable, nodes are colored by determining which nodes they interfere with, in inverse order of removal. A hard register (color) not used by any of a node's neighbors is chosen for the node. This deterministic routine consumes non-exponential time and space. It is not guaranteed to nd an n-coloring if one exists; however, it does produce excellent results in practice. When a graph is not n-colorable, or rather when the heuristic cannot demonstrate that the graph is n-colorable, some symbolic registers (program values) are

22 stored in a non-register resource. Using this resource (usually an o -chip read/write memory) creates a speed penalty, so great care must be taken in choosing which symbolic registers should be so reallocated. Code must be generated to store the symbolic register after each of its de nitions and to load it before each use. This reduces register contention by reducing the lifetime of the spilled symbolic registers, thereby reducing the interference caused by the spilled registers and allowing register assignment to completely color the graph. Because the interference may not be reduced enough and because hard registers are usually still needed temporarily for spilled symbolic registers, a new graph is built and the coloring/spilling cycle repeated as necessary. Symbolic registers are chosen to be spilled based on perceived cost. Those within nested loops or with a high number of uses will be spilled last. The register assignment method employed in ROCKET di ers only slightly from Chaitan's, as it uses an improvement suggested by Briggs et al. [BCKT89] which leads to fewer spilled registers than the Chaitan method.

2.4.2 When to Perform Register Assignment

Having decided to perform register assignment using graph coloring, the important question is when during compilation should register assignment be done? There are several possibilities as shown in Figure 2.3. Proceeding in \chronological" order of the compilation process, the rst choice would be to perform register assignment on the machine-independent intermediate form. This placement is often used in register assignment for non-ILP architectures where the intermediate form is closely related to the nal code produced. For ILP architectures, however, where instruction scheduling reorders operations, we shall see that performing register assignment this early in the compilation process leads to diculties. The next logical place to include register assignment would be during code selection. This approach has long been used in traditional compilers. Graph coloring register assignment, however, is designed to function as a separate pass of the compiler. This design feature is both a strength and a weakness. The decoupling from other compiler phases makes it more dicult to access information from other compiler phases when making register assignment decisions, but makes the algorithm used more independent. Since graph coloring is designed to be a separate pass, we made no attempt to integrate it with code selection. ROCKET could perform graph coloring register assignment on the DDDs generated by code selection. In fact, the Horizon compiler from which ROCKET developed performed graph coloring register assignment at this time [Bea87]. The advantage over earlier register assignment is that the DDDs expose more parallelism inherent in the program than the intermediate form. Bradlee et al. [BEH91] describe combining register assignment with instruction scheduling which moves the process even closer to the nal form of the compilation process. As with register assignment during code selection, however, integrating graph-coloring register assignment and instruction scheduling can be dicult to accomplish in a machineindependent manner. Yet another solution is to delay register assignment until after

23

C Source

?

Parsing N-tuples

?

Traditional Global Analysis and Optimization



Register Assignment ?

Code Selection



Register Assignment ?

DDDs



Register Assignment ?



Register Assignment ?



Register Assignment ?

N-tuples

?

?

Instruction Scheduling Scheduled Instructions

?

Figure 2.3: Possible Placements for Register Assignment

24 scheduling is complete, which allows the compiler to use the best possible estimate of the register usage in the code produced. One nal solution to where register allocation and assignment should be in the compilation process is to delay it past compilation entirely, and perform at least part of the process during program linking. Wall [Wal86] has reported good results with such link-time register assignment. ROCKET supports register assignment at either of two compilation times| early, based upon intermediate statements, and late, after instruction scheduling. This avoids integration of register assignment with another compiler phase (code selection or instruction scheduling) and maintains the machine-independence of register assignment. So, the question remains; where, in the compilation process for an ILP architecture should register assignment be performed to give best results? In one sense, we would like register assignment to be done very late. That way, the compiler maintains the myth of unlimited register resources until after optimizations such as code motion, common subexpression elimination, copy propagation, and dead code removal. If an optimization calls for creation of a new register, or expansion of a (register) variable's lifetime, the compiler can, by delaying register assignment, perform the optimization knowing that register assignment will later \make everything right" with respect to the allocation and assignment of needed register values. Similarly, if an optimization (e.g. dead code removal) removes the need for a register, or shortens a register variable's lifetime, we can be sure, by delaying register assignment, that the lowered register interference will be noticed at register assignment time. Thus, the basic rationale for performing register assignment late is that we wish to assign values to hard registers only after any optimizations which may change either the number of register values needed or those values' lifetimes. If the compiler assigns registers before one or more of these optimizations, it bases assignment and spilling decisions on poor estimates of the register usage in the compilation's nal product. Delaying register assignment causes problems, however, because no target architecture has an in nite number of registers. By delaying register assignment until after code motion phase(s) of the compiler, we may add considerably to the program's register interference, due to lengthening of variable livetracks.6 To investigate the interaction between code motion and register assignment in more depth, consider instruction scheduling as a speci c example of code motion. While scheduling itself will not create or destroy register values, it will most certainly alter the lifetimes of register values by changing the relative order of operations in the intermediate code. Therefore, to use the most accurate data ow information, we would like to delay register assignment until after the compiler's scheduling phase. As stated previously, scheduling takes a data-dependency graph (DDD) and attempts to order the DDD's operations into as few instructions as possible, subject to the DDD's data A variable livetrack is a representation of where a variable is de ned in the control ow graph and where it is used. The length of a livetrack is distance between the de nition of the variable and its last use. 6

25 dependencies. When the compiler assigns registers (i.e. maps symbolic registers to hard registers) before scheduling, it must add dependencies associated with the hard registers. Some of these dependencies (speci cally the anti-dependencies needed between di erent values which share a hard register) are in a sense unnecessary since they do not represent the actual data ow of the program being compiled. Necessary or not, however, they restrict the possible movement of operations during scheduling and, thus, reduce the likelihood of obtaining the minimal number of instructions. Of course, the better schedules permitted by delaying register assignment until after scheduling will be due in part to (symbolic) register de nitions being moved relative to uses of the same register. This could lead to added register interference and require register assignment to use more registers when it is nally performed. If the added interference causes registers to be spilled, the advantageous e ects of the code motion will be seriously impaired. In short then, delaying register assignment until after code motion leads to more opportunities for code motion but could lead to costly register spilling.

2.4.3 Tradeo s

To measure the tradeo s between register assignment before and after instruction scheduling, Sweany and Beaty [SB90, Bea91] compiled the Livermore Loops program [McM86] for a hypothetical machine, a 68020-based engine with an o -chip

oating pointer adder and an o -chip oating point multiplier. The code generated with early register assignment was compared with that using post-scheduling register assignment. With late register assignment, code improvements of up to 20% were found in some longer loops. Overall, the late register assignment resulted in 6% fewer instructions. The authors attribute this to the reduction in anti-dependencies with late register assignment. It should be noted that register usage was greater with late register assignment as well. This is as expected since instruction scheduling was able to extract more parallelism without the register anti-dependencies present in the DDDs. While the added registers did not require spilling on the hypothetical architecture with 32 integer and 32 oating point registers, the possibility exists that late register assignment could cause spilling in a program where early register assignment would not.

Chapter 3 CODE MOTION WITHOUT COPIES Since inter-block code motion o ers opportunities for improved code in ILP architectures, considerable research e ort has been directed toward algorithms which perform inter-block motion. Much of this research has centered around techniques which allow massive inter-block motion of operations but may require multiple copies of moved operations to compensate for the semantic changes inherent in the operations' movement. A major goal of the research described here is the determination of how much inter-block motion is available in C programs when such compensatory copies are not allowed. Since the focus is to quantify inter-block motion for programs, we wish to investigate a level of program abstraction that removes, as much as possible, machine-dependent detail. The intermediate-statement level of program abstraction seems appropriate since it can use the compiler's program analysis but includes little machine-speci c information. Therefore, this chapter investigates intermediate statement-level code motion possibilities in some detail to develop algorithms which will allow the movement of intermediate statements upward in the control ow graph without requiring compensatory copies. The algorithms developed here will subsequently be used to measure the amount of such motion available in C programs. (See Chapter 4.)

3.1 Dominator-Based Code Motion The criteria for our code-motion-without-copies algorithm is that it move code from one basic block to another without inserting additional copies to compensate for change in the program's semantics due to the motion itself. Thus, given a control ow graph (CFG) for the function being compiled, to move an intermediate statement, S, from block B1 to block B2, the algorithm must ensure that: 1. For every possible execution path in the CFG, the value(s) computed by S will be the same if S is moved to B2 as it would be if S remained in B1. 2. For every possible execution path in the CFG, any statement S' which uses value(s) de ned by S will receive the same value(s) if S is moved to B2 that it would if S remained in B1. 3. For every possible execution path in the CFG, if block B1 is executed, block B2 will also be executed.

27 Taken together, the rst two restrictions say that if S moves, the algorithm must ensure that S still produces that same result and that any other statement which depends upon S still receives the same values from S. It may seem that, given the rst two restrictions, the third is not necessary. However, the third restriction, a necessary consequence of the rst two, leads us to an important realization upon which the code motion algorithms to be developed will depend. If, for example, the third restriction were false, there would be at least one execution path, P, which includes B1 but not B2. If S were moved to B2, then S would not be executed if P were taken and thus any S' depending upon S would not receive the same value it would were S left in B1. Thus, we can say that restriction 3 above is a necessary precondition for code motion without copies1. The need to abide by restriction 3 leads to considering only motion of code from a block, B, to one of B's dominators. A block D, dominates another block, B, if all paths from the CFG's root to B must pass through D. By restricting code motion to only move code from a block to one of its dominators, we guarantee that restriction 3 above is followed. I call such code motion dominator-based code motion. In their 1981 paper [RT81] John Reif and Robert Tarjan provide a fast algorithm, herein called RTEB (Reif and Tarjan Expression Birthpoints), for determining the approximate birthpoints of expressions in a program's ow graph. Consider a generalized intermediate statement, S, in basic block, B:

A E

The birthpoint of expression E is the location in the control ow graph where E can rst be computed while guaranteeing that the value computed will be the same as that in the original program. The values computed by RTEB are approximate birthpoints because, rather than nding the exact point (an intermediate statement within a basic block) where the expression is \born," RTEB nds that block, D such that: 1. D dominates B. 2. The value of E when execution leaves D is guaranteed to be the same that is computed by S. 3. D is that dominator of B highest in the control ow graph for which statement (2) is true. Highest in this context means closest to the root of the dominator tree. In other words, RTEB's approximation identi es the block where E is \born" but rather than nding the exact statement at which E becomes live, it indicates that E will be live after the execution of the entire block. Actually, this is a slight simpli cation. If there is no S' which uses the value(s) computed by S, then the algorithm could move S anywhere without changing the program semantics. However, under such circumstances S would be dead code and could be removed entirely. This discussion 1

assumes that dead code would have already been removed.

28 The RTEB algorithm is used as a basis for the code motion algorithms developed here because it provides a close approximation to where an expression is born, and because it is very fast, requiring execution time which is almost linear in the size of the CFG being analyzed. The actual order statistic for the algorithm is: O(m (m; n) + l) While RTEB identi es approximate birthpoints of expressions, it does not provide enough information to allow safe code motion without copies because it only considers expressions, not intermediate statements, which, in general, include not only a computation of some expression but the assignment of the value computed to a program variable. This chapter describes an extension to RTEB, which I call dominator analysis. Dominator analysis computes approximate birthpoints of intermediate statements and is itself used in the dominator motion algorithm, also de ned in this chapter. Dominator motion performs the actual code motion without copies.

3.2 Reif and Tarjan's Algorithm Since dominator analysis and dominator motion, the algorithms de ned here, are based upon Reif and Tarjan's method, a discussion of RTEB is in order. Given a control ow graph (CFG) for a function to compile, RTEB de nes the following:

Dominator(B) The set of all vertices, D, of the CFG which have the property that all paths from the root of the CFG to B must pass through D. A node

may have any number of dominators. By de nition a node dominates itself. Immediate Dominator(B) Called idom(B) in RTEB, the immediate dominator is that dominator of B which is itself dominated by all other dominators of B. A block has a single immediate dominator. Maximal Dominator(B,P) That dominator, MD of B which satis es predicate P, and of all B's dominators which satisfy P is the one closest to B in the CFG. Closest in this context means the rst such dominator encountered on a traversal of the dominator tree from B back towards the root of the tree. Immediate De nition Set(B) Called idef(B) in RTEB, the immediate de nition set of a basic block includes all program variables de ned on any path from idom(B) to B. In other words, any variable which might be de ned after leaving idom(B) but prior to entering B, is included in idef(B). Origin(B,V) The origin for a block, B and variable, V, which is live-in to B is that highest dominator, HD, of B such that the value for V entering B is the same as that leaving HD. (Highest, in this context, is that point closest to the start of the function being compiled.) RTEB uses Lengauer and Tarjan's fast algorithm for nding the dominators of a CFG [LT79]. RTEB consists of three phases which:

29 1. computes idef(B) for each block, B, in the CFG, using the Lengauer and Tarjan fast dominator algorithm, with minor additions to compute the idef sets. 2. computes the origin for each pair < B; V >, where B is a basic block and V is a variable which is live-in to B. 3. builds a directed acyclic graph (DAG) to represent the symbolic value for each expression contained in the program. To illustrate these concepts, Figure 3.1 shows the C source code for function \quicksort", and Figure 3.2 shows the control ow graph for the same function, with the C source statements included in each basic block. Table 3.1 shows, for each basic block in the quicksort control ow graph, the immediate dominator of the block and the def and idef sets for the block. Looking at Table 3.1, we see that blocks B1, B2, B5, B6, and B7 all have empty idef sets, because each can be entered only from the immediate dominator of the block itself. Thus, no program variables could possibly be de ned on any control ow path after leaving the block's immediate dominator but before control

ow enters the block itself. The situation for blocks B3, B4 and B8 is di erent, however. Each of these blocks has multiple entries, so the possibility exists that program variables will be de ned on some entrance to the block which excludes the block's immediate dominator. Consider block B4, whose immediate dominator is block B3. At rst glance, it seems that every entrance to block B4 must pass through block B3, but this is not so. Since B4 is a predecessor of itself (being a single-block loop) B4 may be entered many times after program ow of control leaves its immediate dominator. Thus, any program variable de ned in block B4 itself must be included in the idef set for B4 since it can be de ned after leaving the immediate dominator and before some entrance to B4. This accounts for j being included in B4's idef set since it is de ned in B4. Similarly, anything de ned in B3 must be included in B3's idef set since it forms a single-block loop as well. In addition, though, B3 is also the start of a larger loop which includes blocks B3, B4, B5, and B6. Since program control ow may enter B3 from B6 without going through the immediate dominator of B3, we need to include any program variable de ned in any of blocks B3{B6 in B3's idef. Given def and idef, RTEB computes an origin for each pair < B; X >. The origin, XB for block, B and program variable, X is given by if X 2 idef (B ) [ def (B ) Origin(X B ) = B else Origin(X B ) = D where D is the maximal proper dominator of B such that X 2 idef (D) [ def (D)

30

quicksort(m,n) int m,n; { int i,j,v,x; if( m v ); if( i >= j ) break; x = a[i]; a[i] = a[j]; a[j] = x; } x = a[i]; a[i] = a[n]; a[n] = x; quicksort(m,j); quicksort(i+1,n); } }

Figure 3.1: Quicksort C Source Code

31

if m > n goto B8 i=m-1 j=n v = a[n]

i++ if a[i] < v goto B3 j?? if a[j] > v goto B4 if i >= j goto B7 x = a[i] a[i] = a[j] a[j] = x

B1 l l l l l l l l

B2

EXIT

B8

  

B3

    

B4

   

B5

B6

  

x = a[i] a[i] = a[n] a[n] = x quicksort(m,j) quicksort(i+1,n)

Figure 3.2: Quicksort Control Flow Graph

B7

32 Block IDom De ned IDef B1 ; ; B2 B1 fi,j,vg ; B3 B2 fig fi,j,a,xg B4 B3 fjg fjg B5 B4 ; ; B6 B5 fa,xg ; B7 B5 fa,xg ; B8 B1 ; fi,j,v,a,xg Table 3.1: Dominators and IDefs for Quicksort Knowing the origin for each such pair < B; X >, RTEB computes the birthpoint for each expression computed in a block, B. The birthpoint of any constant term is the CFG's root node, as any constant has a known value at the function's start and that value never changes. The birthpoint, BPXB , for any program variable, X, which is used in some expression in B is the origin, XB . Then, the birthpoint for any expression, E, made up of operators and operands, O1 . . . On (where each Oi must be either a constant or a program variable) is the maximal dominator of B (the dominator closest to B, or possibly B itself) of all the birthpoints for the expression's operands. This birthpoint of an expression represents the earliest position in the dominator chain leading from the root to B where we can safely compute E, knowing that the value computed would be the same as if we performed the computation within B itself. For an example, turn again to Figure 3.2 and Table 3.1. Consider the expression in block B5 i >= j Since neither i nor j are included in either the def or idef sets for B5, the birthpoint of the expression must be some dominator of B5. Looking at the birthpoints of the individual terms, we see that i's birthpoint is B3, since neither of B4's def or idef sets include i, but B3's do. The birthpoint of j, however, is B4 since B4 de nes j. Thus, the birthpoint of the entire expression is B4, since that is the maximal dominator of B5 among the operands' birthpoints for the expression in question. Expression birthpoints can be used to identify global common subexpressions. But, they do not provide sucient information to allow motion of intermediate statements from a block to one of its dominators because the birthpoints only identify possible movement of expressions. They ignore the issue of moving program variable de nitions. What they do tell us, however, is where an expression can rst be computed. Returning to our generalized intermediate statement,

A E ti E

a code motion algorithm based upon RTEB might allow addition of

33 where ti is a compiler-generated temporary, to the block, D, which is the birthpoint of E. This, in turn, allows conversion of the original statement to:

A ti

Since RTEB's birthpoint computation is based on the value a variable will have upon exit from a basic block, for code motion to move E's computation to its birthpoint basic block (as described above) it must ensure that E is computed after existing computations of the birthpoint block. Thus, in the earlier example where we determined that the expression i >= j could be moved from block B5 to B4 in quicksort's CFG, we would need to ensure that we evaluated the expression after B4 decremented j. Note also that by moving expressions from a block to one of its dominators, the compiler potentially alters the number of times the expression is evaluated. Indeed, one major purpose for performing such code motion is to hoist code out of loops whenever possible. RTEB ensures that, no matter how many times a moved expression is computed, the expression's value which reaches the original computation point will be the same, thereby preserving program semantics. This is a direct consequence of the algorithm not moving any existing de nitions, although potentially creating new single-de nition temporaries. When, in the next section, we generalize RTEB to allow movement of intermediate statements, including program variable de nitions, we must ensure that we maintain the original program semantics while moving variable de nitions from a block to one of its dominators which may not be executed the same number of times.

3.3 Dominator Analysis and Dominator Motion RTEB's expression birthpoints are not by themselves sucient to allow code motion of intermediate statements from a block to one of its dominators. To perform code motion of intermediate statements, the compiler needs to compute approximate birthpoints for intermediate statements. The enhanced RTEB algorithm to compute intermediate statement approximate birthpoint is dominator analysis, while the actual code motion itself is performed by dominator motion. So, for the generalized intermediate statement,

A E

in addition to computing the birthpoint of the right-hand-side expression, dominator analysis must consider any variables being assigned as well. To ensure a \safe" move of an expression, RTEB need only ensure that no expression operand move above any possible de nition of that operand. Dominator motion needs to make a similar requirement for the variable being assigned, but it must do more. As well as not moving A above any previous de nition of A, dominator motion must ensure that A does not move above any possible use of A. Otherwise, dominator motion might change A's value at that previous use. To ensure safety, dominator analysis computes a new set, iuse, for each basic block. The immediate use set for a block is de ned analogously to idef, that is

34

iuse(B) is the immediate use set of a basic block which includes all program variables used on any path from the idom(B) to B. In other words, any variable which might be used after leaving idom(B) but before entering B. Given each basic block's iuse set, dominator analysis can compute the birthpoint of any intermediate statement, S, in the program. To determine the proper birthpoint for a statement, S, currently in a basic block, B, dominator analysis requires the following data ow sets:  For each statement, S, in the intermediate form of the program, live-variable analysis provides Use, the set of program variables used by S. This set includes any operands used in the right-hand-side of S. Def, the set of program variables de ned by S. This set is the program variable(s) assigned by S.  For each basic block, B, in the CFG dominator analysis needs the def, idef, use, and iuse sets previously de ned. Standard live-variable analysis yields def and use while dominator analysis computes idef and iuse. In addition, it is convenient to maintain two additional sets for each basic block.

Potential Use (puse).

puse(B ) = use(B ) [ iuse(B ) Intuitively, puse(B) represents the set of variables that are either used in B or might be used on some path into B that avoids B's immediate dominator. Potential Def (pdef). pdef (B ) = def (B ) [ idef (B ) Intuitively, pdef(B) represents the set of variables that are either de ned in B or might be de ned on some path into B that avoids B's immediate dominator. Given these sets, the birthplace for a statement, S, currently in block, B, is: if use(S ) \ pdef (B ) is non-empty OR def (S )\ fpdef (B ) [ puse(B )g is non-empty else

BirthPlace(S B) = B BirthPlace(S B) = D

where D is the maximal proper dominator of B such that either use(S ) \ pdef (D) is non-empty OR def (S )\ fpdef (D) [ puse(D)g is non-empty

35 In other words, dominator motion can move a statement, S, along a path in the dominator tree from B towards the root as long as nothing used in S jumps a possible de nition of the same variable and nothing de ned in S jumps a possible use or de nition of the same variable. By enhancing RTEB to compute birthplaces of intermediate statements instead of expressions, several issues which had no e ect upon RTEB become important. As was mentioned previously, when code motion moves intermediate statements (or just expressions) from a block to one of its dominators, it runs the risk that the statement (expression) will be executed a di erent number of times in the dominator block than it would have been in its original location. Look once again at the quicksort CFG, Figure 3.2 for an example. Assume the expression i >= j is indeed moved from block B5 to B4 as is possible according to the expression's birthpoint. It will potentially be executed many more times in B4 than if left in B5 since B4 has a nesting level of two while B5 is at nesting level one. When expressions are all that we are moving, this is acceptable (although it may not be ecient to move a statement into a loop) since the value needed at the original point of computation will be preserved. As far as program semantics are concerned, it does not matter how many times we compute the same value as long as we get the correct value computed the last time it was computed. This is guaranteed by RTEB. Consider as well the consequences of moving an expression from a block that never is executed for some particular input data. Again, the fact that we compute a value we never use may not be ecient but it does not alter program semantics. When dominator motion moves entire statements, however, the issue becomes more complex. If the statement moved assigns a new value to an induction variable, as in the following example,

n=n+1 dominator motion would change n's nal value if it moved the statement to a block

where the execution frequency di ered from that of its original block. Dominator motion could alleviate this particular problem by prohibiting motion of any statement for which the use and def sets are not disjoint, but the possibility remains that a statement may de ne a variable based indirectly upon that variable's previous value. To remedy the more general problem, dominator motion disallows motion of any statement, S, whose def set intersects with those variables which are usedbefore-de ned in the basic block in which S resides. Suppose the optimizer moves an intermediate statement that de nes a global variable from a block which may never be executed for some set of input data. Then the optimized version might de ne a variable the unoptimized function did not, possibly changing program semantics. We can be sure that such motion does not change the semantics of that function being compiled, but there is no mechanism short of compiling the entire program as a single unit to ensure that de ning a global variable in this function will not change the value used in another. Thus, to conservatively ensure that it does not change program semantics, dominator motion prohibits movement of any statement which de nes a global variable. At rst glance,

36 it may seem that this prohibition cripples dominator motion's ability to move any intermediate statements at all, but we shall see that such is not the case. Another issue related to safe motion is that an algorithm which moves de nitions must update the various sets (use, iuse, def, idef) when code motion occurs. RTEB does not need updates. Since the algorithm is concerned only with relocating right-hand-side expressions, it does not move variable de nitions. Thus, the def and idef sets do not change. So, while relocating right-hand-side expressions changes use and iuse, RTEB does not use or even compute this information. To move entire statements, however, dominator motion must update the various sets. Although statement use and de nition sets do not change, dominator motion needs to update the four basic sets for a block (use, def, iuse, idef.) When moving a statement, S, from block B to block D, a proper dominator of B, dominator motion potentially removes the use and def of S from B; it certainly adds the use and def of S to D; and it also potentially adds to the iuse and idef sets for any block, Q, of the CFG for which D is on some path between Q's immediate dominator and Q. Thus, to properly update the iuse and idef sets for the CFG, dominator motion needs to know, for each block, B, in the CFG, the set of blocks for which B is on some path between idom(Q) and Q. This set of blocks for which B contributes to the iuse and idef sets is the inverse dominator path(B). As we shall see in Section 3.4, dominator analysis computes this set as well. Using the inverse dominator path sets computed by dominator analysis, dominator motion can ensure that code motion is re ected in the iuse and idef sets of all blocks a ected by the new location for the statement. We would like to also change the iuse and idef sets of blocks a ected by the fact that the statement is no longer in its original location. Dominator motion does not, however, attempt to remove the def and use sets from the \appropriate" blocks' idef and iuse sets to re ect motion of the statement out of a block. Unfortunately, this omission is necessary to ensure correct program semantics since we cannot guarantee that the statement moved was the only one which a ected the idef and iuse sets of the appropriate blocks2. One possible remedy for this problem would be to do the entire code motion exercise repeatedly, since it is fast. Yet another consequence of moving intermediate statements instead of expressions is that the amount of code motion depends on prior motion. Consider the following group of intermediate statements:

x=y*4 z=x+y w=z-x

If bags are used instead of sets, this is not a problem. Since a bag allows multiple copies of the same element, removing an element of the bag when a statement moved would provide the correct update. The idef and iuse sets would have that element due to the moved statement removed, but any other contributor to a variable's inclusion in the bag would be left behind. The bag data type cannot be represented as eciently as a set, however, so for reasons of eciency is not supported in ROCKET. 2

37 RTEB could determine some birthplace for the expression

y*4

which would be whatever the origin of y is for the basic block being investigated. We could not move either of the other two expressions since they depend upon values de ned in this block. When performing motion for entire intermediate statements, however, moving the entire statement

x=y*4

to another block makes it (potentially) possible to move further statements|speci cally those using the value of x just de ned. Because of this, to obtain maximum code motion, dominator analysis cannot compute birthpoints of statements once for an entire program, as the RTEB does. This approach works well only when we move expressions; it severely limits motion possibilities when moving de nitions. Instead, to nd the maximum (or even correct) code motion, dominator analysis computes the original sets of use, def, iuse, and idef for each basic block, and dominator motion combines moving statements with updating of the basic block sets. All of these facets are combined into a single algorithm, MoveStatement, which performs dominator motion on a single intermediate statement. The algorithm is shown in Figure 3.3.

3.4 The Algorithms This section presents the algorithms used to compute the necessary information for determining the set of blocks to which each intermediate statement can be moved3. The algorithms themselves are divided into three subsections. Section 3.4.1 includes the basic RTEB algorithm as described by Reif and Tarjan. The dominator analysis algorithm is provided in Section 3.4.2. Finally, Subsection 3.4.3 provides listings for some ancillary algorithms used in the RTEB computation. Since these routines have been modi ed to add the extra information necessary for the dominator analysis algorithm of Section 3.4.2, they are given their own subsection. In both subsections 3.4.2 and 3.4.3, each line possessing an addition to the original RTEB algorithm has been pre xed with \++" to allow the reader to easily identify the additions.

3.4.1 RTEB Algorithm

The discussion makes use of the following arrays which are de ned for the RTEB algorithm:

vertex(i) : The vertex whose depth- rst order number is i. succ(v) : The set of vertices, w, for which (v; w) is a graph edge. Since I base all of these algorithms upon modi cations to the RTEB algorithm I have attempted to follow RTEB notation as much as possible. 3

38 Algorithm MoveStatement(S) Input: Function Control Flow Graph An intermediate statement, S, we wish to move The basic block, Home, in which S currently resides The following data ow information Globals, the set of all global variables Def(S), and Use(S) For each basic block, B, in the control ow graph ImmediateDominator(B) Pdef(B) which is Def(B) [ IDef(B) Puse(B) which is Use(B) [ IUse(B) UsedBeforeDe ned(B) InverseDominatorPath(B) Output: ListOfBlocks(S), the list of basic blocks to which S can be moved without changing the semantics of the program. NewHome, the heuristically chosen block from ListOfBlocks(S) which is where S is moved to Algorithm: /* Build the List of Blocks */ ListOfBlocks(S) = ; TargetBlock = Home if Def(S) \ Globals != ; OR Def(S) \ UsedBeforeDe ned(B) != ; quit | the statement cannot be safely moved Danger = Pdef(TargetBlock) [ Puse(TargetBlock) while Def(S) \ Danger = ; AND Use(S) \ Pdef(TargetBlock) = ; add TargetBlock to ListOfBlocks(S) TargetBlock = ImmediateDominator(TargetBlock) Danger = Pdef(TargetBlock) [ Puse(TargetBlock) Heuristically choose NewHome from among the elements of ListOfBlocks(S), and append S to the statements of NewHome /* Update necessary data ow information */ Def(NewHome) = Def(NewHome) [ Def(S) Use(NewHome) = Use(NewHome) [ Use(S) For each block, B, in InverseDominatorPath(NewHome) Pdef(B) = Pdef(B) [ Def(S) Puse(B) = Puse(B) [ Use(S) Figure 3.3: Dominator Motion for an Intermediate Statement

39

pred(w) : The set of vertices, v, for which (v; w) is a graph edge. parent(w) : The vertex which is w's parent in the spanning forest. dom(w) : The immediate dominator of w. semi(w) : The semi-dominator of w. The semi-dominator array is, in some sense,

a temporary used to store what \looks to be" the immediate dominator of a node. The actual immediate dominator will turn out to be either the semidominator itself or the immediate dominator of the semi-dominator. bucket(w) : The set of vertices whose semi-dominator is w. def(w) : The set of program variables de ned in vertex, w. idef(w) : The set of program variables de ned on some path between the immediate dominator of w and w. sdef(w) : The semi-def set. Like the semi-dominator array, sdef represents the \ rst guess" at the idef set. The RTEB algorithm itself is divided into four steps: 1. Initialization performs a depth- rst search of the CFG, numbering the vertices from 1 to n as they are encountered during the search. In addition, the variables to be used in steps 2-4 receive initial values. 2. While making a pass over the CFG's nodes in reverse depth- rst order, compute values for temporaries which will subsequently be needed for the dominator and idef values in variables semi and sdef. 3. Make \ rst guess" estimates of the dominator and idef values from the semi and sdef variables de ned in step 2. Again, the CFG nodes are investigated in reverse depth- rst order. 4. Make another pass over the CFG basic blocks in depth- rst order, nalizing the dominator and idef values.

During its initial computation of the idom and idef for each vertex of the CFG (in steps 2 and 3), the algorithm maintains (in addition to the CFG) a separate data structure representing a forest contained in the depth- rst spanning tree of the CFG. This forest is built and accessed by three routines:

 Link(v,w): Add edge (v; w) to the forest  Eval(v): If v is the root of a tree in the forest, return v. Otherwise, let r be the root of the tree in which v resides. Return the vertex, u such that semi(u) is minimum for all u on a path from r to v

40

 Evaldef(v): If v is a tree root, return ;. Otherwise, let r = v0 ! v1 ! . . . ! vk = v be a path from the root of the tree containing v to v. Return the union of (def [ idef) for all such v. The Link and Eval routines, described in [LT79], are relatively straightforward. Since they remain unchanged by my additions to the RTEB algorithm, they will not be discussed further here. In contrast, the additions to the RTEB algorithms described here require extensive modi cation of Evaldef, and so the algorithm is included as an ancillary algorithm in Section 3.4.3 (see Figure 3.7.) Figure 3.4 shows the RTEB algorithm, which is annotated with the \step" numbers corresponding to the algorithm's four phases.

3.4.2 Dominator Analysis Algorithm

As discussed in Section 3.3, to enable movement of intermediate statements instead of just expressions, dominator analysis adds to the RTEB's analysis information. Speci cally, the rst additional information is the iuse set for each CFG block. This allows dominator motion to ensure that a variable de nition does not move past a use of that same variable. In addition, since dominator motion needs to update the various data ow sets (use, def, iuse, idef) whenever it moves a statement from one basic block to another, dominator analysis computes, for each block, B, the set of all other blocks, X, for which de nitions and uses in B can contribute to X's idef and iuse sets. But, this is just that set of blocks, X, for which B lies on some path between the immediate dominator of X and X itself. I call this set of blocks the inverse dominator path set of B. Dominator analysis' enhancement of RTEB to compute iuse follows closely RTEB's computation of idef. Computation of the inverse dominator path sets is a little more complicated, but dominator analysis enhances RTEB to include a set ipath of all basic blocks between a nodes' immediate dominator and itself. Then a simple pass over the CFG will allow use of the ipath sets to determine the inverse dominator path sets. In addition to the global arrays already needed for RTEB (Figure 3.4), dominator analysis requires the following additional arrays:

iuse(w) : The set of program variables used on some path between the idom(w) and w. suse(w) : The semi-use set. Like the semi-def set, suse represents the \ rst guess"

at the iuse set. ipath(w) : The set of basic blocks which are included on some path between the idom(w) and w itself. spath(w) : The semi-path set. Like semi-use and semi-def, this is a \ rst guess" at the ipath set.

41 Algorithm RTEB()

f

STEP 1 InitializeDominators() for i := n by -1 until 2 w := vertex(i) STEP 2

STEP 3

end for i

g

for each v in pred(w) do u := Eval(v) if semi(u) < semi(w) semi(w) = semi(u) sdef(w) := sdef(w) [ Evaldef(v) end for each v add w to bucket(vertex(semi(w))) Link(parent(w),w) for each v in bucket(parent(w)) delete v from bucket(parent(w)) u := Eval(v) if semi(u) < semi(v) dom(v) := u else dom(v) := parent(w) idef(v) := sdef(v) [ Evaldef(parent(v)) end for each v

STEP 4 for i:=2 until n do w := vertex(i) if dom(w) not equal vertex(semi(w)) idef(w) := idef(dom(w)) [ idef(w) dom(w) := dom(dom(w)) end for i

Figure 3.4: Reif and Tarjan's IDef Computation

42 In addition to the sets needed for iuse and ipath computations, dominator analysis makes use of some additional global variables required to compute the proper function for Evaldef. As de ned earlier, Evaldef(v), called in the computation of idef sets, returns the set of all variables de ned in any node on a path from the root of v's spanning forest tree to v. To generalize this de nition to include similar information needed for updating iuse and ipath, we could either make new routines, Evaluse and Evalpath, or simply modify Evaldef to return multiple values through the use of global variables. The latter approach is taken here. Thus, dominator analysis needs the appropriate global variables, as well as two arrays to maintain the intermediate set information. The new variables are:

labdef(v) : The set of program variables de ned in any block included on the path from the root of v's spanning forest tree to v. labuse(v) : The set of program variables used in any block included on the path from the root of v's spanning forest tree to v. euse : The global variable used to return the appropriate use set value from Evaldef. intermediate path nodes : The global variable used to return the appropriate set of basic blocks from Evaldef.

The dominator analysis algorithm to compute dominators as well as idef, iuse, and ipath sets for each basic block of the CFG is provided in Figure 3.5.

3.4.3 Ancillary Algorithms

The RTEB algorithm calls four ancillary \routines" in its computation of dominators and idef sets. These are Link, Eval, InitializeDominators and Evaldef. Lengauer and Tarjan [LT79] give two di erent algorithms for Link and Eval. Since Link and Eval remain unchanged in dominator analysis, they are not included here. Both InitializeDominators and Evaldef are substantially enhanced for dominator analysis and are thus shown in Figures 3.6 and 3.7. In addition, Evaldef (as well as Eval) calls yet another routine, Compress, which performs the path compression on the spanning forest data structure. An algorithm for Compress is also included in [LT79], but since Compress too is enhanced for dominator analysis, the revised version is included here, in Figure 3.8. Finally, to properly update the dominator analysis information when an intermediate statement is moved, dominator motion needs to know, for each basic block, B, of the CFG, the set of blocks, D, for which B lies on some path between D's immediate dominator and D. Since this is the inverse of the ipath set computed by the dominator analysis, dominator analysis calls ComputeInverseDominatorPaths to build this set. The algorithm is given in Figure 3.9. In addition to the global arrays already needed for either the RTEB algorithm of Figure 3.4, or the dominator analysis algorithm of Figure 3.5, additional arrays are de ned for InitializeDominators, Evaldef, and/or Compress:

43 Algorithm Dominator Analysis()

f

++ ++ ++ ++ ++ ++

InitializeDominators() for i := n by -1 until 2 w := vertex(i) for each v in pred(w) do u := Eval(v) if semi(u) < semi(w) semi(w) = semi(u) sdef(w) := sdef(w) [ Evaldef(v) suse(w) := suse(w) [ euse add v to inverse dom path(w) spath(w) := spath(w) [ intermediate path nodes labuse(w) := used(w) [ suse(w) labdef(w) := de ned(w) [ sdef(w) inverse dom path(w) := inverse dom path(w) [ spath(w) add w to bucket(vertex(semi(w))) Link(parent(w),w) for each v in bucket(parent(w)) delete v from bucket(parent(w)) u := Eval(v) if semi(u) < semi(v) dom(v) := u else dom(v) := parent(w) idef(v) := sdef(v) [ Evaldef(parent(v))

++ ++

iuse(v) := suse(v) [ euse ipath(v) := spath(v) [ intermediate path nodes for i:=2 until n do w := vertex(i) if dom(w) not equal vertex(semi(w)) idef(w) := idef(dom(w)) [ idef(w) dom(w) := dom(dom(w))

++ ++ ++

g

iuse(w) := iuse(w) [ iuse(dom(w)) ipath(w) := ipath(w) [ ipath(dom(w)) ComputeInverseDominatorPaths()

Figure 3.5: Dominator Analysis

44 Algorithm InitializeDominators()

f

++ ++

for i := 1 until n w := vertex(i) dom(w) = ancestor(w) = 0 bucket(w) = sdef(w) = idef(w) = ; label(w) = semi(w) = w labdef(w) = ; end for i

iuse(w) = suse(w) = labuse(w) = ; ipath(w) = spath(w) = inverse dom path(w) = ;

/* Determine if w is the target of a back edge in the graph If so, set sdef(w) = def(w) */ for each v in pred(w) if semi(v) > semi(w) AND semi(v)