Modulo Scheduling, Machine Representations, and ... - CiteSeerX

2 downloads 0 Views 1MB Size Report
ister requirements of a given modulo schedule by shifting its operations by multiples ..... A.4 Incidence matrix for dependence graph of Figure A.1 (Example 4.4). ..... FX;Y. = f(y ?x) j for all i 2 Q; x 2 Xi; y 2 Yi g. (2.1). Equation (2.1) de nes a ...... First consider in Table 2.6 the average work units associated with the discrete rep-.
Modulo Scheduling, Machine Representations, and Register-Sensitive Algorithms by

Alexandre Edouard Eichenberger

A dissertation submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy (Computer Science and Engineering) in The University of Michigan 1997

Doctoral Committee: Dr. Santosh G. Abraham, Co-Chair Professor Edward S. Davidson, Co-Chair Professor John R. Birge Professor John P. Hayes Professor Trevor N. Mudge Professor Yale N. Patt

ABSTRACT Modulo Scheduling, Machine Representations, and Register-Sensitive Algorithms by Alexandre Edouard Eichenberger

Co-Chairs: Santosh G. Abraham and Edward S. Davidson High performance compilers increasingly rely on accurate modeling of the machine resources to eciently exploit the instruction level parallelism of an application. In this dissertation, we rst propose a reduced machine description that results in signi cantly faster detection of resource contentions while preserving the scheduling constraints present in the original machine description. This approach reduces a machine description in an automated, error-free, and ecient fashion. Moreover, it fully supports the elaborate scheduling techniques that are used by high-performance compilers, such as scheduling an operation earlier than others that are already scheduled, unscheduling operations due to resource con icts, and ecient handling of periodic resource requirements found in software pipelined schedules. Reduced machine descriptions resulted in processing the queries to the resource model in 58.1% of the original time for a benchmark of 1327 loops scheduled by a state-of-the-art modulo scheduler for the Cydra 5 machine. Scheduling techniques such as Modulo Scheduling are also increasingly successful at eciently exploiting the instruction level parallelism up to the resource limit of the machine, resulting in high performance but increased register requirements. In this dissertation, we propose an optimal register-sensitive algorithm for modulo

scheduling that schedules loop operations to achieve the minimum register requirements for the smallest possible initiation interval between successive iterations. The proposed approach supports machines with nite resources and complex reservation tables, such as the Cydra 5. It also uses a well structured formulation that enables us to nd schedules within a reasonable time for more loops (920 of the 1327 loops) and and larger loops (up to 41 operations). We also propose an alternative approach, stage scheduling, which reduces the register requirements of a given modulo schedule by shifting its operations by multiples of II cycles. In addition to achieving a signi cant fraction of the possible decrease in register requirements (e.g. the average register requirements decrease by 22.2% for the optimal modulo scheduler, by 19.9% for the optimal stage scheduler, and by 19.8% for our best stage scheduling heuristic, compared to a register-insensitive modulo scheduler in a benchmark of 920 loops), the stage scheduling approach also preserves the steady-state performance of the initial schedules. By bounding the schedule length of its schedule, we may preserve the transient performance of the original as well. Thus, by coupling ecient stage schedule heuristics with a register-insensitive high-performance modulo scheduler, we may very quickly obtain high-performance schedules with low register requirements. A technique that further decreases the register requirements in predicated code is presented. This technique is based on precisely computing the interferences among virtual registers in the presence of predicated operations.

c Alexandre Edouard Eichenberger 1997 All Rights Reserved

To my parents, Monique and Gerard.

ii

ACKNOWLEDGEMENTS I would like to thank Edward Davidson for his guidance through my graduate studies. His enthusiasm and optimism have made these years both productive and enjoyable. Above all, his search for the right wording will remain a great source of inspiration to me. I still recall the morning of a submission deadline where I found at 8am on my desk a sheet of paper labeled \a better proof for Theorem 2." I also would like to thank Santosh Abraham for his invaluable help during my rst years in Ann Arbor. He has been a constant source of suggestions and encouragement. Next, I would like to extend my gratitude to the members of my committee, John Birge, John Hayes, Trevor Mudge, and Yale N. Patt for their time and insightful guidance. B. Ramakrishna Rau and Mike Schlansker of Hewlett Packard Laboratories have assisted me on numerous occasions, both through professional discussions and research support. Several of ideas pursued in this dissertation have originated from fruitful discussions with them. Also, they have provided me with the benchmark of loops that was used through the dissertation to validate our results. My work has greatly bene ted from interactions with Sadun Anik, Richard Johnson, Vinod Kathail, and Scott Mahlke. Finally, I would like to extend special thanks to my friends and colleagues that have made this time in Ann Arbor so enjoyable, especially Regina Kistner, Waleed Meleis, Peder Olsen, Je Von Arx, Karen Tomko, Michael Golden, JD Wellman, Jude Rivers, Stevan Vlaovic, Eric Boyd, Tien-Pao Shih, Shih-Hao Hung, Eric Hao, Rabin Sugumar, and Gheith Abandah. iii

TABLE OF CONTENTS DEDICATION . . . . . . . . ACKNOWLEDGEMENTS LIST OF TABLES . . . . . . LIST OF FIGURES . . . . . CHAPTERS 1 Introduction . . . .

..........................

ii

..........................

iii

. . . . . . . . . . . . . . . . . . . . . . . . . . vii ..........................

ix

.......................... 1.1 Research Contributions . . . . . . . . . . . . . . . . . . . . . 1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . .

1 4 8

2 Ecient Representations of Microarchitecture Resources 2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Reducing a Machine Description . . . . . . . . . . . . . . 2.3 Building Generating Sets of Maximal Resources . . . . . 2.4 Selecting Synthesized Resources and Resources Usages . 2.5 Reduced Machine Examples . . . . . . . . . . . . . . . . 2.6 Contention Query Module . . . . . . . . . . . . . . . . . 2.7 Performance of the Contention Query Module . . . . . . 2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

9 13 15 19 24 29 37 41 54

3 Overview of Modulo Scheduling . . . . . . . . 3.1 Modulo Scheduling Approach . . . . . . . . 3.2 Register Requirements of a Modulo Schedule 3.3 Scheduling Complexity . . . . . . . . . . . . 3.4 Algorithmic Approach . . . . . . . . . . . . 3.5 Dependence Graph and Transformations . . 3.6 Related Work . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

57 58 64 69 72 74 80

4 Stage Scheduling Algorithms . . . . . . . . . . . . . . . . . . . 4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Stage Scheduling and Register Requirements . . . . . . . . .

86 88 91

iv

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

4.3 Optimal Algorithm for minimum MaxLive . . . . 4.3.1 Scheduling for Minimum Integral MaxLive 4.3.2 Scheduling for Minimum MaxLive . . . . . 4.3.3 Determining Forcing Combinations . . . . 4.4 Stage Scheduling Heuristics . . . . . . . . . . . . 4.4.1 Input to Stage Scheduling Heuristics . . . 4.4.2 Stage Scheduling Transforms . . . . . . . 4.4.3 Acyclic Heuristic (AC) . . . . . . . . . . . 4.4.4 Sink/Source Heuristic (SS) . . . . . . . . . 4.4.5 Up Heuristic (UP) . . . . . . . . . . . . . 4.4.6 Register Sink/Source Heuristic (RSS) . . . 4.4.7 Combining Heuristics . . . . . . . . . . . . 4.5 Schedule-Length Sensitive Stage Scheduling . . . 4.6 Measurements . . . . . . . . . . . . . . . . . . . . 4.6.1 Schedules with Unlimited Schedule Length 4.6.2 Schedules with Bounded Schedule Length 4.6.3 Schedules with Increased Initiation Interval 4.7 Summary . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

97 97 109 118 123 123 125 128 129 130 131 134 134 137 138 147 152 158

5 Modulo Scheduling Algorithms . . . . . . . . . . . . . . . . . . 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Modulo Scheduling and Register Requirements . . . . . . . . 5.3 Scheduling for Minimum MaxLive . . . . . . . . . . . . . . . 5.3.1 Assignment Constraints . . . . . . . . . . . . . . . . 5.3.2 Mapping-Free Resource Constraints . . . . . . . . . . 5.3.3 Simple Formulation of the Scheduling Constraints . . 5.3.4 Ecient Formulation of the Scheduling Constraints . 5.3.5 Average Lifetime Requirements . . . . . . . . . . . . 5.3.6 Bu er Requirements . . . . . . . . . . . . . . . . . . 5.3.7 Register Requirements . . . . . . . . . . . . . . . . . 5.4 Optimum Modulo Schedules . . . . . . . . . . . . . . . . . . 5.5 Bounding the Schedule Space . . . . . . . . . . . . . . . . . 5.6 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Evaluation of the Integer Linear Programming Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.2 Maximum Throughput Measurements . . . . . . . . 5.6.3 Register Requirements Measurements . . . . . . . . . 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

161 162 165 167 169 169 172 173 181 182 183 187 190 192

6 Register Allocation for Predicated Code 6.1 Predication and Live Ranges . . . . . . 6.1.1 Predicate Extraction . . . . . . 6.1.2 Live Range Analysis . . . . . . 6.1.3 Interference . . . . . . . . . . . 6.2 Post-scheduling Bundling . . . . . . .

210 213 217 220 222 225

v

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . ..

. . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

194 199 200 206

6.2.1 Ordering Heuristics 6.2.2 Selection Heuristics 6.3 Pre-scheduling Bundling . 6.4 Measurements . . . . . . . 6.5 Summary . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

226 227 228 229 234

7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . 236 7.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . 241

APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

vi

LIST OF TABLES Table

2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6

Results for the Cydra 5. . . . . . . . . . . . . . . . . . . . . . . . . . Results for a subset of the Cydra 5. . . . . . . . . . . . . . . . . . . . Results for the DEC Alpha 21064. . . . . . . . . . . . . . . . . . . . . Results for the MIPS R3000/R3010. . . . . . . . . . . . . . . . . . . . Characteristics of the 1327 loop benchmark suite when using the Iterative Modulo Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . Performance of the query functions when using the Iterative Modulo Scheduler (average work units per call). . . . . . . . . . . . . . . . . . Performance of the query functions when using the Iterative Modulo Scheduler (average microseconds per call). . . . . . . . . . . . . . . . Characteristics of the 1327 loop benchmark suite using the Modi ed Iterative Modulo Scheduler. . . . . . . . . . . . . . . . . . . . . . . . Performance of the query functions when using the Modi ed Iterative Modulo Scheduler (avg. work units per call). . . . . . . . . . . . . . . Performance of the query functions when using the Modi ed Iterative Modulo Scheduler (average microseconds per call). . . . . . . . . . . . Example target machine (hypothetical processor). . . . . . . . . . . . Number of MRTs for 19 operations with general-purpose functional units. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of MRTs for 19 operations with specialized functional units. Variables for the stage scheduling model. . . . . . . . . . . . . . . . . Characteristics of the 1327-loop benchmark suite for schedules with unlimited schedule length. . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of the stage scheduling algorithms with unlimited schedule length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Run time for the stage-scheduling algorithms on a Sun Sparc-20 workstation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of the 1327-loop benchmark for schedules with bounded schedule length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Characteristics of the stage scheduling algorithms with bounded schedule length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

30 31 32 32 43 45 47 51 52 52 59 72 72 98 138 146 147 148 151

4.7 4.8 5.1 5.2 5.3 5.4 6.1 6.2 6.3 6.4

Average II for machines with nite register les. . . . . . . . . . . . . 155 Average execution time ratio for machines with nite register les. . . 156 Variables for the modulo scheduling model. . . . . . . . . . . . . . . . 168 Numbers of variables and constraints for 4 distinct formulations. . . . 189 Characteristics of the modulo scheduling algorithms (using the structured scheduling constraints). . . . . . . . . . . . . . . . . . . . . . . 198 Characteristics of the 920-loop benchmark suite. . . . . . . . . . . . . 200 Extracting P-facts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Characteristics of the 21-loop benchmark suite. . . . . . . . . . . . . 229 Normalized decrease in register requirements for the selection heuristic. 232 Normalized decrease in register requirements for the ordering heuristics.232

viii

LIST OF FIGURES Figure 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

2.10 2.11 2.12 3.1 3.2 3.3 3.4 3.5 3.6 3.7 4.1 4.2 4.3 4.4 4.5 4.6 4.7

Reducing a machine description. . . . . . . . . . . . . . . . . . . . . Three situations when adding an elementary pair to a resource. . . . Building the generating set for our example machine. . . . . . . . . . Selection heuristics for our example machine. . . . . . . . . . . . . . . Reservation tables for a subset of the Cydra 5. . . . . . . . . . . . . . Reservation tables for the DEC Alpha 21064. . . . . . . . . . . . . . . Reservation tables for the MIPS R3000/R3010 (part 1). . . . . . . . . Reservation tables for the MIPS R3000/R3010 (part 2). . . . . . . . . Performance of the contention query module when using the Iterative Modulo Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of the contention query module, including the Iterative Modulo Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance of the contention query module when using the Modi ed Iterative Modulo Scheduler. . . . . . . . . . . . . . . . . . . . . . . . Performance of the contention query module, including the Modi ed Iterative Modulo Scheduler. . . . . . . . . . . . . . . . . . . . . . . . Key concepts and terminology of modulo scheduling (Example 3.1). . Register requirement of a modulo schedule (Example 3.1). . . . . . . Register vs. bu er requirements (Example 3.1). . . . . . . . . . . . . Kernel 7 of the Livermore Fortran Kernels. . . . . . . . . . . . . . . . Modulo schedule for Example 3.2. . . . . . . . . . . . . . . . . . . . . Modulo schedule for Example 3.3. . . . . . . . . . . . . . . . . . . . . Removing edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MRT and stage schedules with load-y scheduled early (Example 4.1). MRT and stage schedules with load-y scheduled late (Example 4.1). MRT and stage schedules with add scheduled early (Example 4.2). . . MRT and stage schedules with add scheduled late (Example 4.2). . . Computing the underlying cycle constraint (Example 4.3). . . . . . . Skip and delta factors for stage scheduling (Example 4.4). . . . . . . Stage scheduling model for minimum integral MaxLive. . . . . . . . . ix

16 20 23 27 33 34 35 36 49 49 53 53 60 65 68 71 76 77 78 93 93 95 95 101 102 108

4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6.1 6.2 6.3 6.4

Fractional part of the lifetimes (Example 4.4). . . . . . . . . . . . . . 114 Fractional parts of the lifetimes (Example 4.5). . . . . . . . . . . . . . 118 Initial stage-schedule with additional skip factors (Example 4.5). . . . 126 Up-propagation transform. . . . . . . . . . . . . . . . . . . . . . . . . 127 UP propagation heuristic. . . . . . . . . . . . . . . . . . . . . . . . . 130 Register source heuristic example. . . . . . . . . . . . . . . . . . . . . 132 Improved stage-schedule (Example 4.5). . . . . . . . . . . . . . . . . . 133 Stage scheduling and schedule length (Example 4.6). . . . . . . . . . 135 Register requirements in the 1327-loop benchmark suite. . . . . . . . 140 Additional register requirements relative to the MinReg Stage-Scheduler.141 Additional register requirements of stage scheduling heuristics. . . . . 143 Schedule length of one loop iteration in the 1327-loop benchmark suite. 144 Additional register requirements (unlimited schedule length). . . . . . 150 Additional register requirements (bounded schedule length). . . . . . 150 Increasing II to reduce the register requirements using the Iterative Modulo Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Increasing II to reduce the register requirements using the MinReg Stage-Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Impact of increasing II on the schedule length. . . . . . . . . . . . . 157 Modulo schedule with add scheduled early (Example 5.1). . . . . . . . 166 Modulo schedule with add scheduled late (Example 5.1). . . . . . . . 166 Determining the fractional and integral part of a lifetime. . . . . . . . 185 Modulo scheduling for minimum MaxLive. . . . . . . . . . . . . . . . 188 Average number of branch-and-bound nodes per loop visited by the CPLEX solver in the 711-loop benchmark. . . . . . . . . . . . . . . . 195 Average number of simplex iterations per loop performed by the CPLEX solver in the 711-loop benchmark. . . . . . . . . . . . . . . . 195 Register requirements in the 920-loop benchmark. . . . . . . . . . . . 201 Additional register requirements relative to the MinReg ModuloScheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Additional register requirements of the MinReg Stage-Scheduler relative to the MinReg Modulo-Scheduler in the 920-loop benchmark, broken down by loops in each II range. . . . . . . . . . . . . . . . . . 205 Detail of the additional register requirements of the MinReg StageScheduler (blown up region of Figure 5.9). . . . . . . . . . . . . . . . 205 Additional register requirements relative to the MinReg ModuloScheduler in the 752-loop benchmark. . . . . . . . . . . . . . . . . . . 206 Register requirements of a conditional statement (Example 6.1). . . . 214 Register requirements of an IF-converted conditional statement (Example 6.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 Reconstructing a data ow graph. . . . . . . . . . . . . . . . . . . . . 217 Flow and Def component of a live range. . . . . . . . . . . . . . . . . 221 x

6.5 6.6 6.7 6.8 A.1 A.2 A.3 A.4 A.5 A.6

Determining interferences. . . . . . . . . . . . . . . . . . . . . . . . . Interference algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . Register requirements for the 21-loop benchmark suite. . . . . . . . . Additional register requirements of the pre- and post-scheduling bundling heuristics relative to the Predicate-Sensitive Lower Bound. . Dependence graph and its spanning tree (Example 4.4). . . . . . . . . Formulation with underlying cycle constraints (Example 4.4). . . . . Formulation with variables p0;2 and p0;5 eliminated; remaining pi;j variables correspond to the edges of the spanning tree of Figure A.1b. . . Incidence matrix for dependence graph of Figure A.1 (Example 4.4). . Network matrix for spanning tree of Figure A.1. . . . . . . . . . . . . System of inequalities corresponding to Figure A.3. . . . . . . . . . .

xi

222 224 230 233 247 247 248 251 251 251

CHAPTER 1 Introduction High performance for VLIW and superscalar processors is achieved by exposing the inherent parallelism in an application and capitalizing on the resulting instruction level parallelism to achieve an ecient instruction schedule. Three major components are critical to obtaining ecient, high-performance code at compile time. First, high performance compilers need to work with an accurate model of the available machine resources so that they can schedule the code in a way that reduces or avoids the resource contentions that may stall some of the pipelines. Second, these compilers must use sophisticated algorithms to extract the instruction level parallelism available in the application and eciently utilize that parallelism up to the resource limit of the machine. Third, for eciency, the resources provided by the processor con guration should be well balanced with respect to the resource requirements of the applications. This dissertation focuses primarily on enhancing the rst two components. To eciently utilize the machine resources of a microprocessor, high performance compilers rely on precisely detailed machine models that account for the resources used by the operations of a schedule [19][27][42][45][59][78]. Precise modeling of machine resources is critical to avoid resource contentions that may stall some of the pipelines or, in the absence of hardware interlocks, corrupt some of the results. Eciently modeling the machine resources is important since high performance compilers spend a signi cant amount of compilation time scheduling operations, and thus testing for resource contentions. 1

We address this issue by proposing a reduced machine description that results in signi cantly faster detection of resource contentions while exactly preserving the scheduling constraints present in the original machine description. We demonstrate how to derive a reduced machine description and present several examples illustrating the e ectiveness of our approach. The main contribution of our approach, compared to previous work [9][68][76], is that the machine resources are eciently modeled without restricting the functionality of the scheduling algorithms. In particular, our approach e ectively supports schedulers that achieve high performance by using a backtracking mechanism that reverses a limited number of previous scheduling decisions [59], by using software pipelining to overlap the execution of consecutive loop iterations [48][55][57][81], or a combination of both [27][49][78][83]. Thus, by using our reduced machine description, the computational requirements of these scheduling algorithms, and other similar algorithms, should decrease as they would no longer be penalized by less ecient machine resource models. To improve the performance of an application and bene t from the increasingly large number of available resources found in high-performance processors, compilers rely on exposing sucient parallelism from the application. There is generally insuf cient instruction level parallelism within individual basic blocks, but higher levels of parallelism can be obtained by exploiting the instruction level parallelism among successive basic blocks. In straight line code, for example, trace scheduling is used to merge several basic blocks along a frequent execution path into a single enlarged block [42][50]. These blocks can be further enlarged by using predicated execution [48] to merge several execution traces [60][86]. Similarly, in loop code, software pipelining extends the scope of compilation beyond one basic block by overlapping the execution of consecutive loop iterations [27][48][49][78][94]. Predicated execution can also be used within loop code to merge blocks across conditional statements in software pipelined loops as well [27][78][87]. When sucient instruction level parallelism has been exposed, the compiler can then successfully hide the latency of the operations by eciently scheduling the code while enforcing the dependence constraints among operations and the resource con2

straints of the machine. While machine resources such as stages of functional units, register ports, and internal buses are traditionally expressed using reservation tables, and thus can be eciently modeled using reduced machine descriptions proposed here, other limited machine resources, such as the registers, are more dicult to model within this framework. This problem is further aggravated by the fact that schedules exhibiting high concurrency generally result in higher register requirements as well; consequently, a signi cant body of research has sought scheduling algorithms that result in high performance and low register requirements [14][43][49][58][75]. This problem is crucial to the performance of future machines as higher levels of parallelism inherently exacerbate the register requirements [64]. In this dissertation, we address the issue of register-sensitive schedulers in the context of software pipelined loops. First, our research aims to understand the fundamental relation between instruction level parallelism and register requirements for a set of loop benchmarks under realistic assumptions. To that e ect, we investigate an optimal modulo scheduling algorithm that searches for a loop schedule with the highest steady-state throughput over all modulo schedules, and the minimal register requirements among theses maximum throughput schedules. Our work extends previous research [6][44][63][70] in several directions: we precisely model the register requirements in each cycle of the schedule, we schedule for machines with nite resources and arbitrary reservation tables, and we explore the entire space of modulo schedules using a novel and more well-structured formulation that nds optimum solutions more quickly. As a result, schedulers based on solving integer linear programs should be more applicable, since we have signi cantly decreased the computational complexity of these schedulers by using a better structured formulation while fully capitalizing on the strength of optimal solvers, since the register requirements are characterized fully accurately. We have also developed a set of heuristics that closely approximate the minimum solution with linear, or at most quadratic, computational complexity. This approach is based on a set of heuristics that reduce the register requirements of a software pipelined loop by rescheduling some of its operations. Compared to prior work on 3

lifetime-sensitive modulo scheduling heuristics [49][58], this approach searches for schedules with reduced register requirements while preserving the performance of the original schedule. This approach should be extremely bene cial to high-performance compilers since, by using it, compilers can rst focus on searching for a schedule with the highest performance, and then focus on reducing the register requirements while preserving the previously achieved level of performance. While performance is generally increased by using predicated execution [48] to exploit instruction level parallelism in the presence of conditional branches [27][60][78][94], code with predicated operations may result in arti cially higher register requirements, if the predicate under which registers are de ned and used is not taken into account when allocating virtual registers to physical registers. To limit this negative impact on the register requirement, we address this issue by precisely computing the interferences among virtual registers in the presence of predicated operations.

1.1 Research Contributions Ecient Representations of Microarchitecture Resources To better utilize machine resources and reduce or prevent resource contentions, high performance compilers use precisely detailed microarchitecture representations [19][27][42][45][59][78]. Ecient representations are important since high performance compilers spend a signi cant amount of compilation time scheduling operations, and thus testing for potential resource contentions. When a benchmark suite of 1327 loops from the Perfect Club [13], SPEC-89 [91], and the Livermore Fortran Kernels [65] is scheduled for the Cydra 5 machine [11][27], approximately 50% of the total time is spent modeling the resources (i.e. answering queries such as \can this operation be scheduled in this cycle"); the other 50% of the total time is spent scheduling operations (i.e. deciding the order in which operations are scheduled, initiating the queries to the resource model, and enforcing the data dependences), when using a 4

given machine description for the Cydra 5 and our implementation of the Iterative Modulo Scheduler [78]. We propose a reduced machine representation that is precise and ecient for a compiler to use in its scheduling algorithms. Our contribution is a technique [31] that reduces a machine representation to result in signi cantly faster detection of resource contentions while exactly preserving the scheduling constraints present in the original machine. The resulting representation is expressed using reservation tables that determine the usage of synthesized resources for each operation. The major advantage of this approach over previous work [9][68][76] is that no restrictions are imposed on scheduling algorithms other than the need to satisfy the constraints of the machine itself. Reduced representations for the DEC Alpha 21064 [28], MIPS R3000/R3010 [53], and Cydra 5 [11] indicate potentially 4.0 to 6.9 times faster contention queries, while requiring 22 to 67% of the memory storage used by the original machine descriptions. Dynamic measurements obtained when scheduling 1327 loops of the benchmark suite for the Cydra 5 machine indicate that the queries to the resource model actually complete in 58.1% of the original time, on average, when using a highly reduced machine description rather than the original machine description.

Register-Sensitive Modulo Scheduling Modulo scheduling is a software pipelining technique that achieves higher levels of instruction level parallelism by overlapping the execution of consecutive loop iterations [80]. Modulo scheduling uses the same schedule for each iteration of a loop and initiates successive iterations at a constant rate, i.e. one Initiation Interval (II clock cycles) apart. Modulo scheduling results in high performance code, but increased register requirements due to higher levels of concurrency [64]. Thus, developing register-sensitive modulo schedulers is particularly critical for high performance machines. Our rst contribution to register-sensitive modulo schedulers is an optimal algorithm [36] that nds a schedule with minimum register requirements among all 5

maximum throughput modulo schedules. This algorithm, referred to as the MinReg Modulo-Scheduler, is based on an integer linear programming formulation that can handle loop iterations with up to 41 operations when using no more than 15 minutes of computation time on a HP-9000/715 workstation. The MinReg Modulo-Scheduler extends previous work [7][44][70] by precisely modeling the register requirements, accurately modeling machines with nite resources and arbitrary reservation tables, and exploring the scheduling space with a more ecient (structured) formulation. In the 1327 loop benchmark suite for the Cydra 5 machine, for example,the MinReg ModuloScheduler nds schedules with lower register requirements for 13.0% of the 1327 loops, compared to minimizing the best approximation of the actual register requirements found in the literature [29][70]. Also, using the structured formulation [34] results in a decrease in the total execution time of the solver by a factor of 8.6, over all the loops successfully scheduled by both the structured and the traditional formulation of the modulo scheduling search space. Our second contribution to register-sensitive modulo schedulers, is an approach that reduces the register requirements of a given modulo schedule by rescheduling some of its operations. In this approach, referred to stage scheduling, operations can be moved only by integer multiples of II cycles, where II is the initiation interval between consecutive iterations of a given modulo schedule. The major advantages of the stage scheduling approach are its reduced complexity relative to the full scheduling problem above and the fact that lower register requirements are achieved without impacting the steady-state performance of the original modulo schedule. Furthermore, by bounding the maximum schedule length of an iteration, e.g. forcing it not to exceed that of the original schedule, a signi cant decrease in register requirements can be achieved with no impact on the transient performance of the schedules as well. We contribute an optimal stage scheduling algorithm [35][37], referred to as the MinReg Stage-Scheduler, that nds a schedule with minimum register requirements among all stage schedules. We also contribute a set of stage scheduling heuristics [33] that closely approximate the minimum solution with linear, or at most quadratic, computational complexity. 6

Our results indicate that, for all the 920 loops successfully scheduled by the MinReg Modulo-Scheduler in no more than 15 minutes on a HP-9000/715 workstation, the register requirements are decreased by 22.2% on average by using this optimal modulo scheduler rather than a register-insensitive modulo scheduler. When using the optimal stage scheduler in conjunction with the register-insensitive Iterative Modulo Scheduler [78], the register requirements decrease by 19.9% on average. Moreover, the average register requirements decrease by 19.8 and 19.2% when our best stage scheduling heuristics with unlimited schedule length and bounded schedule length, respectively, are employed. Thus, the stage scheduling heuristics appear to provide almost all of the possible register reduction, as only a few percent further reduction is possible by using an algorithm that nds a modulo schedule with minimum register requirements among all modulo schedules with a given initiation interval. As a result, the stage scheduling approach is well suited for high-performance compilers, since they may rst solely focus on searching for a schedule with the highest performance, and then, if needed, they can use a stage scheduling heuristic with bounded schedule length to signi cantly decrease the register requirements while preserving both the transient and steady-state performance of the original highperformance schedule.

Register Allocation for Predicated Code High performance compilers use predicated execution [48] to increase the instruction level parallelism of an application by merging several basic blocks into an enlarged predicated block [27][60][78][94] with IF-conversion [4][72]. This generally results in higher performance code but increased register requirements, an e ect that is partly due to current compilers not allocating registers as well for predicated code as for unpredicated code. We investigate here a framework that reduces the increase in register requirements of predicated code by precisely computing the interferences among virtual registers in the presence of predicated operations. The framework is based on the logical relations 7

among the predicate values de ned and used in an enlarged predicated block. Using this framework, we propose a technique [32] that reduces the register requirements by allowing non-interfering virtual registers whose lifetimes overlap in time to share a common virtual register. For 21 loops taken from the Perfect Club, SPEC, and Livermore benchmarks that have one or more pairs of disjoint predicate values, this technique reduces the register requirements by 7.1%.

1.2 Dissertation Overview This dissertation is organized as follows. A technique that eciently models the microarchitecture resources for the compiler is presented in Chapter 2. An overview of modulo scheduling and our approach to register-sensitive modulo scheduling is described in Chapter 3. Algorithms for stage-scheduling and modulo-scheduling are investigated in Chapters 4 and 5, respectively. A technique that reduces the register requirements for predicated code is shown in Chapter 6. Conclusions are presented in Chapter 7.

8

CHAPTER 2 Ecient Representations of Microarchitecture Resources Current compilers for VLIW and superscalar machines focus on exploiting more of the inherent parallelism in an application in order to obtain higher performance. Fine grain schedulers are a critical element in eciently exploiting instruction level parallelism and a signi cant body of research has sought more e ective scheduling algorithms. Several new directions have been explored: schedulers may not schedule operations in cycle order, focusing initially on operations along critical paths [27][30][43][49][77][95], they may backtrack to reverse poor scheduling decisions [27][49][59][77][83][94], and they may hide long latencies by speculating operations across branches and basic blocks [12][20][27][30][59][67]. High performance compilers have also used precisely detailed machine models [19][27][42][45][59][78] to better utilize the machine resources of current processors with increasingly wider issue mechanisms, deeper pipelines, and more heterogeneous functional units. Precise modeling of machine resources is critical to avoid resource contentions that may stall some of the pipelines or, in the absence of hardware interlocks, corrupt some of the results. Resource modeling has to cope with rapidly changing processor models while controlling development cost by reusing existing compiler technology.

9

To meet these challenges, compilers have increasingly relied on a resource modeling utility, separated from the rest of the compiler, that can quickly answer the following query: \Given a target machine and a partial schedule, can I place this additional operation in this cycle without resource contention?" Typically, this functionality has been provided by a contention query module that processes the machine description of a target machine, generates an internal representation of the resource requirements, and provides for a querying mechanism [19][27][42][45][59]. The IMPACT compiler, for example, implemented such a module [45] to produce high performance schedules for a wide range of machines, from existing architectures such as X86, PA-RISC, and SPARC to research architectures such as PlayDoh [54]. With the recent emphasis on exploiting instruction level parallelism, compile time is increasingly spent in the contention query module as several cycles of a schedule, possibly from several basic blocks [12][67], are queried per operation in order to achieve good schedules. Optimizing contention query modules therefore has a signi cant impact on the overall performance of a compiler, as these time-consuming queries are issued in the innermost loop of the scheduler. For example, the high performance scheduler used in our experiments spends approximatively 50% of its time in the contention query module, when using a manually optimized machine description is used to schedule a benchmark of 1327 loops. Optimizing query handling has recently been addressed in several papers [9][68][76], resulting in greatly improved query response time, but also limiting to some extent the functionality of the contention query module. Examples of additional functionality not covered by previous research in optimized query contention modules are the unscheduling of operations due to resource contentions and the ecient handling of the periodic resource requirements found in software pipelined schedules. In this chapter, we propose a reduced machine description that results in signi cantly faster detection of resource contentions while exactly preserving the scheduling constraints present in the original machine description and without restricting the functionality of schedulers. The reduced machine description is expressed using reservation tables that determine the resource usage for each operation. We demon10

strate how to derive a reduced machine description for a given target machine and present several examples illustrating the e ectiveness of our approach. The proposed approach fully supports unrestricted scheduling models where operations can be scheduled in arbitrary order and prior scheduling decisions can be reversed. Unrestricted scheduling is essential to accommodate the elaborate scheduling techniques used by today's high performance compilers. The Cydra 5 compiler [27], for example, uses an operation-driven scheduler that reduces the schedule length of a basic block by scheduling operations along the critical path rst. Operation-driven schedulers consider operations in topological order, not in order of monotonically increasing (or decreasing) schedule time. Also, the Cydra 5 and IMPACT compilers, as well as others, use software pipelining techniques to achieve loop schedules with high throughput [27][67][94]. Software pipelining schedulers do not consider operations in topological order as, in general, no topological order is de ned in dependence graphs with loop-carried dependences. Moreover, experimental results indicate that software pipelined loops can achieve higher throughput in less compilation time when some limited number of scheduling decisions can be reversed due to violated dependence or resource constraints, as shown by Rau [78]. Limited backtracking is used in numerous compilers, e.g. when scheduling software pipelined loops [27][49][78][83][94] and scalar code [59]. The proposed approach also precisely handles basic block boundary conditions, i.e. the dangling resource requirements from predecessor basic blocks. In general, the resource requirements at the beginning of a basic block consist of the union of all the resource requirements dangling from predecessor basic blocks. Handling boundary conditions is even more important for high performance compilers that hide operation latencies by (speculatively) moving operations across branches and basic blocks [12][20][27][30][59][67]. Both the Cydra 5 and the Multi ow compilers, for example, use scheduling algorithms that handle dangling resource requirements [27][59]. Currently, most compilers rely on machine descriptions that have been manually reduced using error prone ad-hoc methods. To avoid errors or to reduce the machine 11

description more easily or further, conservative assumptions may be employed. Thus, the reduced machine description may prohibit certain operation sequences that cause no contentions on the target machine. Furthermore, high performance compilers are often developed in parallel with microarchitecture development during which resource requirements often change. Manually reducing the machine description must then be carried out several times, introducing more potential for errors, suboptimal solutions, and increased development and maintenance cost. Using our approach, the resource requirements can be expressed in terms close to the actual hardware structure of the target machine, and the reduced machine description used by the compiler is generated in an error-free and automated fashion. Experiments with the DEC Alpha 21064 [28], MIPS R3000/R3010 [53], and Cydra 5 [11] machines indicate potentially 4.0 to 6.9 times faster contention queries, while requiring 22 to 67% of the memory storage used by the original machine descriptions. Dynamic measurements obtained when scheduling 1327 loops from the Perfect Club [13], SPEC-89 [91], and the Livermore Fortran Kernels [65] for the Cydra 5 machine [11] indicate that the essential work performed by the contention queries decreases by a factor of 2.76 to 3.30, depending on the functionality required by the scheduler. This decrease in essential work results, in turn, in a 1.72 to 1.83 faster execution time of the contention queries. These improvements are obtained by using highly reduced machine descriptions instead of the original or manually optimized machine description. In this chapter, we present related work in Section 2.1 and an introductory example in Section 2.2. Algorithms to construct reduced machines are developed in Sections 2.3 and 2.4. Reduced machine examples are presented in Section 2.5. A contention query module is developed in Section 2.6 and its performance is investigated in Section 2.7. We present our conclusions in Section 2.8.

12

2.1 Related Work Resource contention in multipipeline scheduling may be based directly on reservation tables, or on the forbidden latency sets or contention-recognizing state machines derived from them, as introduced by Davidson et al [25]. Traditionally, reservation tables contain much redundant information that consumes memory and increases query response time. As a result, recent advances favor nite-state automaton approaches. In this chapter, however, we propose a reduced reservation table approach that eliminates much of the redundancy and does not su er from the limitations of the automaton approaches, as detailed below. Proebsting and Fraser [76] as well as Muller [68] proposed a contention query module using a nite-state automaton that recognizes all contention-free schedules. The technique proposed by Proebsting and Fraser directly results in minimal nitestate automata [76]. This approach was recently extended for unrestricted scheduling models by Bala and Rubin using a forward and reverse pair of automata [9]. In their approach, operations considered in order of monotonically increasing (or decreasing) schedule time are quickly scheduled using a forward automaton. Additional operations are then inserted in the schedule in cycles recognized as contention-free by the forward and reverse automata. Because an inserted operation introduces additional resource requirements, these additional requirements propagate in adjacent cycles, i.e. the state in adjacent cycles must be updated in both the forward and reverse automata to keep the state machines consistent. Their approach also addresses the handling of basic block boundary conditions at the cost of introducing new states; recent experimental evidence gathered by Bala and Rubin shows, however, that no additional states are introduced for the Alpha, PA RISC, and MIPS families if minimal nite-state automata are constructed [8]. The principal advantage of automaton-based approaches is that a single table lookup can determine the next contention-free cycle. A potential problem of this approach is the size of these automata, an issue that is addressed in the literature in the following two ways. First, operations of a target machine can be combined 13

into classes of operations that have compatible resource contentions [76]. Second, large automata can be factored into sets of smaller ones [9][68]. However, although factoring reduces the size of the automata, it increases the number of table lookups necessary to process a contention query. Another potential problem arises when supporting unrestricted scheduling models, since the state of the forward and reverse automata must be saved for each scheduled operation, which may result in a large memory overhead, especially for wide-issue machines. Supporting unrestricted scheduling models also requires the consistency of the stored state to be maintained when scheduling additional operations [9], as inserted operations introduce additional resource requirements. Thus, handling unrestricted scheduling models introduces both memory and computation overhead that is similar to, or may exceed, the overhead incurred by the reservation table approach. More importantly, some scheduling algorithms require additional functionality from the contention query module that is not currently provided by modules based on the nite state automaton approach. For example, some backtracking schedulers may schedule an operation even though it may result in resource contentions, in which case the earlier scheduled operations that con ict are then unscheduled. This mechanism is a key component of the scheduling algorithm proposed by Rau to generate high performance software-pipelined loop schedules at a low level of computational complexity [78]. In the experiments presented in Section 2.7, this mechanism enables the scheduling algorithm to nd schedules with maximal throughput in 95.5% of the loops, versus 93.8% without it. Using reservation tables, this mechanism can be easily implemented by keeping a mapping from each reserved resource to the scheduled operation that consumes it. Providing this mechanism for the nite-state automaton approach corresponds to modifying the paths in the forward and backward automata so that the new operation is accepted by the automata in the desired cycle, by unscheduling the operations that were included in the original path but are not present in the new path. It is unclear at this point how such a mechanism would be eciently implemented in practice. Another example of functionality that is not currently provided by modules based 14

on the nite state automaton approach is the handling of resource requirements in software pipelined schedules that are traditionally represented with Modulo Reservation Tables. While it is clearly possible to build one automaton per period, it is unknown if a more economical approach is possible.

2.2 Reducing a Machine Description In this section, we illustrate the three-step process of constructing a synthesized machine, resulting in reduced numbers of resources and resource usages while exactly preserving the scheduling constraints due to resource contentions in the target machine. The machines handled here may have arbitrary resource requirements, including alternative resource usages, but must have latencies that are known at compile time. Alternative resource usages, e.g. operation X using either resource 0 or 1 interchangeably, require some preprocessing. In this case, we would replace operation X in the original machine description with two new operations, X 0 and X 1, identical to X but with operation X 0 exclusively using resource 0 and operation X 1 exclusively using resource 1. In general, new operations are created until all alternative resource usages present in the original machine description are removed. The operations generated from a unique operation in the original description are subsequently referred to as alternative operations, e.g. X 0 and X 1 are the alternative operations of X . We begin with a given machine description consisting of a set of reservation tables, one per operation, that expresses the resource requirements of each operation in terms close to the actual hardware structure of the target machine. The rows of a reservation table correspond to distinct resources of the target machine and its columns correspond to the cycles in which resources are used relative to the issue time of the corresponding operation. An X entry in row i/column j is made in the reservation table associated with operation X if there is a usage of resource i in cycle j by operation X , i.e. if resource i is reserved for exclusive use during cycle j by operation X . 15

resources

a) Machine description (reservation tables)

0

cycles 0 1 2 A

1

A

3

4

5

6

7

B A

2

cycles 0 1 2

(usage sets)

B

3

B

B

B

B B

4

b) Forbidden latency set matrix

B

{−1}

B

{1}

{−3,−2,−1, 0,1,2,3}

res.

ops.

{0}

B0= o /

A1={1}

B1={0}

A2={2}

B2={1}

/ A3= o

B3={2,3,4,5}

A4= o /

B4={6,7}

c) Generating set of maximal resources

operations A B A

A0={0}

3

0’

cycles 0 1 2 B A

1’

B

B

res.

d) Reduced machine description (reservation tables) cycles 0 1 0’ A

cycles 0 1 2 B

3

1’

B

B

B

B

B

(usage sets)

A0’={1} B0’={0} A1’= o /

B1’={0,1,3}

Figure 2.1: Reducing a machine description. Figure 2.1a shows the reservation tables of a hypothetical machine with 2 operations (A and B ) and 5 distinct resources (0; : : : ; 4). Operation A is representative of the resource requirements of a fully pipelined functional unit. Operation B is representative of the resource requirements of a partially-pipelined functional unit, where resource 3 may correspond to a multiply stage used for 4 consecutive cycles and resource 4 may correspond to a rounding mode stage used for 2 consecutive cycles. Although this hypothetical machine was constructed to concisely illustrate our methodology, it is representative of some of the resource usage patterns found in our benchmark examples (see Figure 2.5 for example).

Step 1. For each pair of operations, we extract from the corresponding pair of

reservation tables of the target machine the set of forbidden latencies, i.e. the set of initiation intervals for which a resource contention occurs between the two operations. 16

Visually, the set of forbidden latencies between operations X and Y is obtained by overlapping their reservation tables, and searching for all initiation intervals (o sets between the overlapped tables) that result in simultaneous use of one or more shared resources. To formalize the de nition of forbidden latencies, we de ne the usage set Xi as the set of cycles in which operation X reserves resource i for exclusive use. Figure 2.1a illustrates the usage sets of our example machine. For example, usage set B3 is equal to f2; 3; 4; 5g, as operation B uses resource 3 in cycle 2, 3, 4, and 5. Two operations, X and Y scheduled at times tX and tY , respectively, con ict if and only if there is some resource, i, and elements x 2 Xi and y 2 Yi such that tX + x = tY + y, i.e. both operations use resource i simultaneously. When such a con ict occurs, operation X cannot be scheduled (y ? x) cycles after operation Y . We thus obtain FX;Y = ff j operation X cannot be scheduled f cycles after operation Y g, i.e.

FX;Y = f(y ? x) j for all i 2 Q; x 2 Xi; y 2 Yi g

(2.1)

Equation (2.1) de nes a matrix of forbidden latency sets for all pairs of operations, where FX;Y is the set in row X , column Y of the matrix. Figure 2.1b illustrates this matrix computed for our example machine. While these sets are computed for each operation of the target machine, we need list these sets only for each operation class, as presented by Proebsting and Fraser [76]. In general, two operations belong to the same operation class if they have the same sets of forbidden latencies, i.e. operations X and Y belong to the same class if FX;Z = FY;Z and FZ;X = FZ;Y for each operation Z of the target machine. Note two properties of the forbidden latency matrix. First, operation X necessarily con icts with itself for an initiation interval of 0 (i.e. 0 2 FX;X ) if it uses any resources. Second, operation X cannot be scheduled f cycles after operation Y if and only if Y cannot be scheduled f cycles before X (i.e. f 2 FX;Y , ?f 2 FY;X ).

17

Formal Problem De nition. Generate a reduced machine description, i.e. one

with a reduced number of resources and resource usages in its reservation tables, which for each operation pair produces exactly the same forbidden latencies as the target machine. One of several objective functions (e.g. the number of resources or resource usages) may be minimized, depending on the desired internal representation. Querying for resource contentions using either the original or reduced machine descriptions yields the same answer, as both descriptions enforce the same forbidden latencies. Note that to schedule a given machine, we need know only whether, not where, resource con icts occur. Thus we are free to represent the machine with any set of synthetic resources and usages that preserves the forbidden latencies.

Step 2. We build the generating set of maximal resources which is de ned as a set of

resources that contains all maximal resources associated with the target machine [73]. A maximal resource is de ned as a synthesized resource such that (a) every forbidden latency generated by this resource is forbidden in the target machine and (b) no additional usage by any operation can be added to this resource without generating a forbidden latency that is not forbidden in the target machine. Finally since shifting all usages in a resource by some constant in time has no e ect on the forbidden latencies, we only consider maximal resources that have their earliest usage in cycle 0. From the construction in Theorem 2.1 below, it follows that there are only two maximal resources in our example machine, as shown Figure 2.1c. The rst resource, resource 00, is a maximal resource that generates 1 2 FB;A; it also includes forbidden latencies: 0 2 FA;A, 0 2 FB;B, and ?1 2 FA;B . Note that no other usages of A or B can be added to resource 00 , as they would necessarily introduce forbidden latencies not present in the forbidden latency matrix of our example machine. Similarly the second maximal resource, resource 10, generates FB;B which includes all the remaining forbidden latencies and no other usages can be added. Note that the forbidden latency sets of the maximal resources need not be disjoint, e.g. 0 2 FB;B is generated by both maximal rows here. The maximal resources are interesting because any reservation table that gener18

ates the same forbidden latency matrix can be constructed from subsets of maximal resources, where the selected usage set in each resource may be translated by some number of cycles. As a result, we can use a (possibly empty) subset of the usages of each maximal resource to cover all the forbidden latencies of a target machine with the fewest number of (nonempty) synthesized resources.

Step 3. We select a subset of the maximal resources and their resource usages which

covers all the forbidden latencies in the forbidden latency matrix. The selection heuristic minimizes an objective function that varies as a function of the desired internal representation. For our example machine, if the objective is to minimize the number of synthesized resources, we must select both resources 00 and 10, since resource 00 is the only resource covering 0 2 FA;A and resource 10 is the only resource covering 3 2 FB;B. However, if the objective function is to minimize the number of resource usages, we may also remove the second or third usage of B in resource 10, shown Figure 2.1c, since the three remaining usages of B are sucient to generate all the forbidden latencies in FB;B. Comparing Figure 2.1a to Figure 2.1d, we can appreciate the bene t of reducing the reservation tables of a target machine. First, the reduced machine description reduces the number of resources from 5 to 2, thus potentially decreasing the memory requirements needed to store the reserved resources of a schedule. Second, the number of resource usages decreases from 3 to 1 for operation A, and from 8 to 4 for operation B . If detecting resource contentions is linear in the number of usages, the reduced machine description results in signi cantly faster queries. We re ne this view in Section 2.6.

2.3 Building Generating Sets of Maximal Resources In this section, we present an algorithm that constructs the generating set of maximal resources, a set that contains all the maximal resources of a target machine. 19

a) Fully compatible

b) Partially compatible

0

1

2

Resource

X

X

Z

Elementary pair

X

Updated resource

X

X

Z

3

0

1

2

Resource

X

X

Z

Y

Elementary pair

X

Y

Add new resource

X

Compatible with elementary pair

Z

c) Partially compatible but with no other usages (called "incompatible") 3

0

1

2

Resource

X

X

Z

3

Y

Elementary pair

X

Y

No update, no new resource

Y

Incompatible with elementary pair

Figure 2.2: Three situations when adding an elementary pair to a resource. The algorithm builds the maximal resources incrementally, adding usages to current resources and creating new resources when appropriate. It is an ecient algorithm that does not backtrack; however, it may produce some submaximal resources in addition to all the maximal resources. A mechanism to remove submaximal resources as well as redundant maximal resources is discussed in Section 2.4. We do not consider here the forbidden latencies directly but employ elementary pairs of usages that generate them. We de ne the elementary pair associated with forbidden latencies f 2 FX;Y as a usage by operation X in cycle 0 and a usage by operation Y in cycle f . We also de ne a compatibility relation between an elementary pair and the usage of a resource. Elementary pair, p, with usages u0 and u1 is compatible with a usage, u, in resource q if the (nonnegative) forbidden latencies generated by u; u0 and u; u1 are both in the forbidden latency matrix. Note that any resource with n usages can be constructed from n ? 1 elementary pairs, namely by (a) shifting all its usages by some constant so that its rst usage occurs in column 0, (b) choosing one usage in column 0 and constructing a set of elementary pairs consisting of this usage together with each other usage in the resource, and (c) placing all these pairs, which are known to be compatible since they exist together in the given resource, together in one resource which is then the same as the given resource. 20

Algorithm 2.1 (Building Generating Sets of Maximal Resources) The rst

step assigns the initial generating set to the empty set and builds a list of elementary pairs associated with the forbidden latencies of the target machine. We exclude here the elementary pairs associated with the negative forbidden latencies (f < 0) since they are redundant (i.e. f 2 FX;Y , ?f 2 FY;X ). We also exclude the elementary pairs associated with the 0 self-contention latencies (0 2 FX;X ), which are processed as a simple special case at the end of the algorithm. The second step attempts to add the rst elementary pair on the list to each of the resources of the current generating set in turn. One of two cases will occur when attempting to add elementary pair p to resource q:

 Rule 1. Elementary pair p is fully compatible with resource q, i.e. with all usages in q. In this case, we add the usages of p to resource q. Rule 1 is illustrated in Figure 2.2a for the elementary pair associated with the forbidden latency 3 2 FX;Y .

 Rule 2. Elementary pair p is partially compatible with resource q, i.e. it is

incompatible with at least one usage in q. We may not simply add the usages of p to resource q, as this would generate some forbidden latencies not present in the forbidden latency matrix. Instead, we leave resource q in the current generating set unchanged and consider adding a new resource, consisting of the usages of p and all the usages of resource q that are compatible with p. If this new resource is not simply p itself with no other usages, then it is added to the current generating set; otherwise it is discarded. If the new resource is discarded, we say that p is incompatible with q. Rule 2 is depicted in Figures 2.2b and 2.2c.

Rule 3. After applying Rule 1 or Rule 2, as appropriate, for p with each resource

in the current set, and placing the corresponding updated and new resources in the current set, the second step adds elementary pair p itself as a new resource if the two usages of p are not yet found together in any single resource in the set. 21

Elementary pair p is then removed from the list of elementary pairs and the second step of the algorithm is executed repeatedly until the elementary pair list is empty. The third step is detailed in Rule 4 below. Rule 4. For each operation, X, (if any) that has 0 2 FX;X as its only forbidden latency, add a new maximal resource that consists of one usage, by operation X in cycle 0. All other operations, Y , must be part of at least one elementary pair, and 0 2 FY;Y was forbidden automatically when the rst elementary pair with a Y usage was processed.

Theorem 2.1 Building the generating set of maximal resources as described in Algo-

rithm 2.1 produces resources that forbid only those latencies that are forbidden in the target machine. Furthermore, the nal generating set includes all maximal resources of the target machine.

Proof. Rules 1, 2, 3, and 4 never place a usage in a resource unless it is compatible with each other usage in that resource, i.e. no resource in the current (and hence nal) generating set forbids any latency not forbidden in the target machine. We prove the second part of Theorem 2.1 by contradiction. Suppose there is a maximal resource that is not in the generating set. We shift its resource usages so that its earliest usage occurs in cycle 0 and call that maximum resource q. Case 1. If q has a single usage, by operation X , either Rule 4 applies and thus q is present in the nal generating set, or Rule 4 does not apply and thus q is not a maximal resource since at least one resource (with two or more usages) in the nal generating set has a usage associated with operation X , which contradicts the original assumption. Case 2. If q has n usages (n > 1), u1 to un , with u1 in cycle 0. We refer to the elementary pairs containing usage u1 and each of the other n ? 1 usages as p2 to pn . Without loss of generality, we may assume that the elementary pairs p2 to pn are numbered in the order in which they are processed by Algorithm 2.1.

22

After p2 is processed (by Rules 1, 2, and 3), its corresponding usages u1 and u2 are present in at least one resource, q12, of the current generating set. Other elementary pairs are then processed, possibly adding usages to q12, but never removing any. Eventually, the algorithm will process p3. From resource q, we know that the usages of p2 are both compatible with p3 . Hence Rule 1 or 2 will result in a resource q123 containing usages u1, u2, u3, and possibly others; Rule 3 will not apply. Repeating this process with all the remaining elementary pairs, including p4 through pn , we obtain resource q12:::n containing all usages u1 through un , and possibly others. Thus the set of usages in resource q is a subset of the usages in q12:::n which is in the nal generating set. Hence either q  q12:::n and q is not maximal, or q = q12:::n and q is present in the nal generating set, which contradicts the initial assumption. 2 a) Process 1 in F

0’

0

1

B

A

0

1

B

A

1’

B

B

B

A

Rule 3: create 0’ with elementary pair

b) Process 1 in F

0’

B,A

B,B

B

B

incompatible Rule 3: create 1’ with elementary pair

c) Process 2 in F 0

1

0’

B

A

1’

B

B

0

1

B

A

1’

B

B

B

B

2 incompatible Rule 1: fully compatible: add elementary pair

B

d) Process 3 in F 0’

B,B

2

3

B

B

B,B

B

B

incompatible Rule 1: fully compatible: add elementary pair

Figure 2.3: Building the generating set for our example machine. Figure 2.3 illustrates the algorithm, step by step, for our example machine. The algorithm processes the four nonnegative forbidden latencies (excluding the 0 selfcontention latencies) 1 2 FB;A, 1 2 FB;B, 2 2 FB;B, and 3 2 FB;B, in that order. The 0 self-contention latencies are included automatically without special single usage resources in this example. The generating sets are shown at each step, in Figures 2.3a, 2.3b, 2.3c, and 2.3d, respectively. The rule applied to each resource is also indicated to the right of each resource. 23

2.4 Selecting Synthesized Resources and Resources Usages Once the generating set of maximal resources has been computed, we select a subset of these resources and their usages that covers all the forbidden latencies in the forbidden latency matrix. The selection heuristic attempts to minimize an objective function that varies as a function of the desired internal representation for partial schedules. In this chapter, we consider the two following internal representations.

Discrete-Representation. This representation uses a reserved table with one row

per resource and one column per schedule cycle. Each entry contains a ag indicating whether the corresponding resource has been reserved by an operation in the current partial schedule. Entries may contain additional elds, such as a eld identifying which operation is reserving the corresponding resource, as used in the Iterative Modulo Scheduler algorithm [78], or a eld identifying the predicate under which the resource is reserved, as proposed in the Enhanced Modulo Scheduling scheme [94]. Because the number of entries tested to detect resource contentions is proportional to the number of resource usages over all reduced reservation tables, the primary objective of the selection heuristic is to minimize the number of resource usages in the reduced machine description. This objective function is referred to as res-uses.

Bitvector-representation. This representation extracts the ag bits of the discrete

representation and packs them into one bitvector per schedule cycle (and reduced reservation tables are represented likewise). If k bitvectors can be packed per memory word, the number of words tested to detect resource contentions is reduced, as are the memory requirements for storing the reserved table. The primary objective of the selection heuristic is now to minimize the number of words that need to be tested, i.e. the number of nonempty groups of k consecutive cycles in the reduced reservation tables. A secondary objective is to maximize the numbers of resource usages in these nonempty words, as more resource usages per word permit faster (early out) detection of resource contentions. This objective function is referred to as 24

k-bitvector-uses, where k is the number of bitvectors packed in a single memory word.

Selection Heuristic. Although integer programming can solve these minimumcover

problems, we have found a fast and e ective heuristic. The rst step of the heuristic prunes the synthetic resources of the generating set by successively removing each resource that produces a set of forbidden latencies that is generated or covered by one of the remaining resources. At this point all nonmaximal resources, if any, and possibly some maximal resources (e.g. the mirror image of a maximal resource that contains only one operation type is also a maximal resource, but they are not both needed) have been eliminated. The resulting set of synthetic resources, referred to as the pruned generating set, is used in the remainder of this algorithm. The second step of the heuristic computes for each selected forbidden latency a list containing all usage pairs in the pruned generating set that generate that forbidden latency. We build these lists for each nonnegative forbidden latency, excluding the 0 self-contention latencies which are ignored in Steps 2 and 3 and are processed as a special case in Step 4 of the algorithm. The resulting lists provide two pieces of useful information. First, each list precisely enumerates every usage pair that generates the forbidden latency that it is associated with. Thus, the selection problem can be simply reformulated as a covering problem which selects at least one usage pair of each list. Second, the cardinality of each list indicates whether a forbidden latency is \rare" or \common" in the pruned generating set, i.e. whether there are only a few usage pairs, or many, that generate its corresponding forbidden latency. The third step of the heuristic selects synthetic resources and usage pairs from the pruned generating set so that at least one usage pair in each list is selected and each selected pair locally minimizes the desired objective function. We process the lists in increasing order of cardinality, since common forbidden latencies are more likely than rare forbidden latencies to be generated as a byproduct of selecting multiple usage pairs within a single resource, i.e. without being explicitly selected. When processing a list, the heuristic selects from the list the usage pair p that results in the maximum \bene t," i.e. the usage pair that covers the largest number of forbidden latencies not yet covered by the currently selected resource usages. In case 25

of ties, the heuristic selects the usage pair whose newly covered forbidden latencies have a larger sum. Once a usage pair is selected, its corresponding resource and the selected usages (as discussed below) are marked as selected. When using the discrete-representation, the heuristic simply selects the two usages of p, since the objective is to minimize the total number of usages. When using the bitvector-representation, however, the heuristic also marks as selected each usage that (a) belongs to a currently selected resource (b) occurs within the k next cycles as one of the usages of p, and (c) corresponds to the same operation as that usage in p. These additional usages can be safely added since they are permitted by the maximal resource in which they were found, and they are desirable since they permit faster detection of resource contentions. Depending on the word alignment in memory of the reservation and reserved tables, these additional usages do not impact, or increase by at most one, the number of memory words that need to be tested. If the cumulative e ect of such increases is felt to be a potential problem, the alignment of these tables can be controlled by, for example, modifying condition (b) to require an added usage to be in the same k-cycle memory word. This third step of the selection heuristic then processes the next list with no already selected selected usage pair in it, with priority given to the list with the smallest cardinality, and continues until all lists have at least one selected usage pair. Finally, the fourth step of the heuristic creates a new resource with a unique usage in cycle 0 by operation X for each operation, X , (if any) that has 0 2 FX;X as its only forbidden latency. Note that all other 0 self-contention latencies have already been covered as a byproduct of the usages selected in Step 3. We rst illustrate the selection heuristic for a discrete representation using the generating set shown in Figure 2.1c (also shown in Figure 2.4a). In Step 1, the heuristic determines that both synthesized resources 00 and 10 must be included in the pruned generating set since resource 00 is the only resource generating 1 2 FB;A and resource 10 is the only resource generating 3 2 FB;B. The resulting pruned generating set is shown in Figure 2.4a. 26

res.

a) Pruned generating set (Step 1)

3

0’

cycles 0 1 2 B A

1’

B

B

B

B

b) Lists of usage pairs (Step 2) 1 in F : B,A 3 in F : B,B 2 in F : B,B 1 in FB,B:

res.

c) Selection of resources and resouce usages (Step 3, discrete representation) cycles 0 1 0’ A

cycles 0 1 2 B

3

c1) process 1 in F B,A

cycles 0 1 0’ A

cycles 0 1 2 B

3

cycles 0 1 0’ A

cycles 0 1 2 B

3

1’

B

B

1’

B

B

c2) process 3 in F B,B

B

c3) process 2 in F B,B

res.

d) Selection of resources and resource usages (Step 3, bitvector rep., 4 cycles per word) cycles 0 1 0’ A

cycles 0 1 2 B

3

d1) process 1 in F B,A

cycles 0 1 0’ A

cycles 0 1 2 B

3

1’

B

B

B

B

d2) process 3 in F B,B

Figure 2.4: Selection heuristics for our example machine. In Step 2, the heuristic constructs the lists of usage pairs associated with the four nonnegative forbidden latencies of Figure 2.1b, excluding the 0 self-contention forbidden latencies. As shown in Figure 2.4b, there are two lists containing one usage pair (associated with the forbidden latencies 1 2 FB;A and 3 2 FB;B), one list with two usage pairs (associated with 2 2 FB;B), and nally one list with three usage pairs (associated with 1 2 FB;B). In Step 3, the heuristic selects resources and resource usages from the two synthesized resources of the pruned generating set, by processing the lists in increasing order of cardinality. Using the lists shown in Figure 2.4b in top down order, the heuristic rst processes the list associated with 1 2 FB;A. It selects the only usage pair in the list, marking as selected resource 00 along with its usages in cycle 1 by operation A and in cycle 0 by operation B , as shown in Figure 2.4-c1. The heuristic then pro27

cesses the second list with one usage pair, corresponding to 3 2 FB;B, resulting in the selection of resource 10 along with its usages in cycles 0 and 3 by operation B , as depicted in Figure 2.4-c2. The algorithm then proceeds to the list with two usage pairs, corresponding to 2 2 FB;B. It evaluates the bene ts of including each of the two usage pairs of resource 10 by operation B , with usages in cycles 0 and 2 versus in cycles 1 and 3. Since both usage pairs cover the same two new nonnegative forbidden latencies (1 2 FB;B and 2 2 FB;B), the heuristic arbitrarily selects the second usage pair, marking the usage of resource 10 in cycle 1 by operation B as selected, resulting in the reservation tables shown in Figure 2.4-c3. The fourth list is not processed since its rst pair of usages has already been selected. The third step is thus now complete. Step 4 does not apply here, as there is no operation with the 0 self-contention latency as its only forbidden latency. The resulting reservation tables in Figure 2.4-c3 are identical to the ones shown in Figure 2.1d. If the selection heuristic is used for a bitvector representation with 4 bitvectors per word, the third step of the algorithm would proceed as follows. Using the same list ordering as the one used in the above example, the heuristic rst processes the list associated with 1 2 FB;A. It marks as selected resource 00 along with its usage in cycle 1 by operation A and in cycle 0 by operation B , as shown in Figure 2.4-d1. As there are no additional usages to add to this resource, it then proceeds with the list associated with 3 2 FB;B, resulting in the selection of resource 10 along with its usages in cycle 0 and 3 by operation B . Unlike with the discrete representation, the heuristic will then also mark as selected every usage of resource 10 by operation B in the pruned generating set that lies in the cycle intervals 0 to 0 + k = 4 and 3 to 3+ k = 7. It thus selects the additional usages of resource 10 by operation B in cycles 1 and 2, as depicted in Figure 2.4-d2. Since at least one usage pair in each of the four lists has now been selected, and Step 4 does not apply, the algorithm completes. Note that the reservation tables for the bitvector representation has one more usage than the ones for the discrete representation, namely a usage of 10 by operation B in cycle 2. 28

Note that this selection algorithm is heuristic, and may not produce a globally optimum solution since we cannot guarantee that pruning in Step 1 does not eliminate all globally optimum solutions, and the greedy selection procedure used in Step 3, while locally optimum, does not guarantee global optimality.

2.5 Reduced Machine Examples In this section, we present experimental results for three machines, the DEC Alpha 21064, the MIPS R3000/R3010, and the Cydra 5. Each reduced machine description generates exactly the same forbidden latency matrix as the original machine description. For the Cydra 5 machine descriptions, we also veri ed that precisely the same schedules were produced regardless of the machine description used by the compiler when scheduling a benchmark suite of 1327 loops obtained from the Perfect Club [13], SPEC-89 [91], and the Livermore Fortran Kernels [65]. For each machine and internal representation, we present three data points, including the total number of resources in the machine description, the average number of resource usages per operation class in the machine description, and the average number of nonempty bitvector words that need to be tested to answer a query, i.e. the number of nonempty groups of k consecutive cycles in the reservation tables. The third metric, referred to as word usage, is averaged over all operation classes and all possible alignments between the bitvectors encoding the reserved table and the reservation table. In this section we assume that each operation class has the same frequency, which tends to yield pessimistic average usages since simple operations are usually more frequent than complex operations. We also assume that the performance of the contention query module is proportional to the average resource usage or word usage per operation, depending on the internal representation. A more detailed performance analysis of the contention query module for the Cydra 5 machine is presented in Section 2.7. As a proof of concept, we investigated our technique on the Cydra 5 machine [11] which has the most complex resource requirements of the three machines. The ma29

chine con guration investigated here has 7 functional units: 2 memory port, 2 address generation, 1 FP adder, 1 FP multiplier, and 1 branch unit. The original machine description used by the Cydra 5 Fortran77 compiler [27] was manually optimized, i.e. some physical resources were eliminated from the machine description as they did not introduce any new forbidden latencies [84]. This description models 56 resources and 152 distinct patterns of resource usages, resulting in 52 distinct operation classes with 10223 forbidden latencies. It is signi cantly larger than the machine descriptions used in previous studies, e.g. it has 3.5 times more operation classes and 2.4 times more forbidden latencies than the MIPS R3000/R3010 machine description used in [76]. Our algorithm reduced this original Cydra 5 machine description in less than 11 minutes on a Sun Sparc-20 workstation. Representation: original discrete Objective function { res-uses minimizing: number of resources 56 15 resource usages / operation 18.5 8.5 word usages / operation 13.5 6.8

bitvectors

(32 bits)

(64 bits)

1-bitvector-

2-bitvector-

4-bitvector-

uses

uses

uses

15 9.0 6.3

15 10.2 4.8

15 11.6 3.4

Table 2.1: Results for the Cydra 5: 52 operation classes, 10223 forbidden latencies (all  41). Table 2.1 presents data for four reduced machine descriptions and the original description of the Cydra 5. The second column corresponds to the machine description reduced for discrete-representations, i.e. it attempts to minimize res-uses. The three remaining columns correspond to machine descriptions for reduced bitvector representation, i.e. they attempt to minimize k-bitvector-uses and secondarily maximize res-uses. Underlined numbers correspond to the entries that the respective objective functions seek to minimize. Compared to the original machine description, the reduced machine descriptions decrease the number of modeled resources by a factor of 3.7 (from 56 to 15). For discrete representation, the average resource usage is reduced by a factor of 2.2 (from 30

18.5 to 8.5). The reduced machine description uses only 26.8% of the data storage used by the original machine description to store the reserved table (from 56 to 15 entries per cycle in the reserved table). Similarly, for the 64 bit word bitvector representation, the average word usage is decreased by a factor of 4.0 (from 13.5 to 3.4). Only 25.0% of the data storage used by the original machine description is required to store the reserved table (4 cycles of 15 bits each versus 1 cycle of 56 bits per word). Note the successive increases in the number of resource usages when reducing for 1, 2, and 4 cycles per word. These increases permit faster detection of resource contentions (fewer average words tested per query) without increasing the memory space required for state storage. Representation: original discrete Objective function { res-uses minimizing: number of resources 39 9 resource usages / operation 11.0 3.6 word usages / operation 8.7 3.2

bitvectors

(32 bits)

(64 bits)

1-bitvector-

3-bitvector-

7-bitvector-

uses

uses

uses

9 3.6 3.2

9 4.5 2.3

9 5.3 1.7

Table 2.2: Results for a subset of the Cydra 5: 12 operation classes, 166 forbidden latencies (all  21). Table 2.2 presents similar data for the subset of operations actually used in the 1327 loop benchmark compiled for the Cydra 5. Comparing the original description to the reduction for a 64 bit word bitvector representation, the reduced machine description decreases the average word usage by a factor of 5.1 (from 8.7 to 1.7). The reservation tables associated with the machine descriptions of the original model, the discrete reduction, and the 64 bit word bitvector reduction are shown, respectively, in Figures 2.5a, 2.5b, and 2.5c. Table 2.3 shows the results of the reductions for the DEC Alpha 21064 [28] using the machine description presented by Bala and Rubin [9]. Comparing the original description to the speci c reduction for a 64 bit word bitvector representation, the average word usage is decreased by a factor of 6.9. This reduced representation detects 31

Representation: original discrete Objective function { res-uses minimizing: number of resources 8 5 resource usages / operation 15.0 5.9 word usages / operation 13.7 5.2

bitvectors

(32 bits)

(64 bits)

1-bitvector-

6-bitvector-

12-bitvector-

uses

uses

uses

5 5.9 5.2

5 9.7 2.9

5 10.3 2.0

Table 2.3: Results for the DEC Alpha 21064: 10 operation classes, 279 forbidden latencies (all  58). all resource contention by testing on average two 64 bit words even though the largest forbidden latency is 58 cycles. Bala and Rubin have presented factored nite-state automata for this processor with (237 + 232) states in the two forward automata and (237 + 231) states in the two reverse automata. By encoding each factored state in 8 bits, and assuming that each state is stored as opposed to recomputed, 64 bits of memory per schedule cycle are needed to cache the 8 states per cycle of the factored forward and reverse automata for this dual issue microprocessor, compared to 5 bits per schedule cycle for the bitvector speci c reductions. The reservation tables associated with the machine descriptions of the original model, the discrete reduction, and the 64 bit word bitvector reduction are shown, respectively, in Figures 2.6a, 2.6b, and 2.6c. Representation: original discrete Objective function { res-uses minimizing: number of resources 22 7 resource usages / operation 17.3 7.3 word usages / operation 11.0 5.6

bitvectors

(32 bits)

(64 bits)

1-bitvector-

4-bitvector-

9-bitvector-

uses

uses

uses

7 8.1 5.6

7 8.3 2.4

7 8.5 1.6

Table 2.4: Results for the MIPS R3000/R3010: 15 operation classes, 428 forbidden latencies (all  34).

32

b) Discrete representation machine description

a) Original machine description for a subset of the Cydra 5 (39 resources, 132 resource usages)

c) Bitvector representation machine description with 7 cycles per 64 bit memory word (9 resources, 63 resource usages)

(9 resources, 43 resource usages)

0 00 0 0 00 0 0 01 1 1 11 1 1 1 11 2 2 22 2 2 0 12 3 4 56 7 8 90 1 2 34 5 6 7 89 0 1 23 4 5

0 00 0 0 0 00 0 0 11 1 1 11 1 1 11 2 2 0 12 3 4 5 67 8 9 01 2 3 45 6 7 89 0 1

0 0 00 0 0 00 0 0 11 1 1 1 11 1 1 12 2 0 1 23 4 5 67 8 9 01 2 3 4 56 78 90 1

0 1 2 3 4

0’

0’

0 1 2 4 5 6 7 8

0’ 4’

9 10 11 12 13 14 15

1’ 5’ 6’

9 10 11 12 13 14 15

1’ 5’ 6’

9 11 12 14 15 16

1’ 6’

17 18 19 20

2’

17 21

2’ 3’

2’ 3’

22 23 24 25

7’

7’

5 6 7 26

4’

27 28 29 30

8’

21 31 32 33 34 35 36 38

3’

9 10 11 12 13 14 15

1’ 5’ 6’

iadd, isub, iaddc, isubc, iabs, iselect, ICOMPARE, icds, lzcnt, ars, LOGIC, fadd, fsub, fabs, FCOMPARE, funord, fcsisf, fcsfsi, fcsfsit 0’ 4’ mvda

1’ 5’ 6’ idiv, fdivd

1’ 5’ 6’ imod

1’ 6’ fmcopy, fmselect, isex, isexd, extex, extexd, impy, fmpy 2’ M1READ, m1noop, M1WRITE

stufficr, stuffbar

M2READ, m2noop, M2WRITE 4’ a1add, a1sub, abrev 8’ a2add, a2sub, abrev 3’ brtop, nexti

1’ 5’ 6’ ?

Figure 2.5: Reservation tables for a subset of the Cydra 5.

33

a) Original machine description for the DEC Alpha 21064 (8 resources, 150 resource usages) 0 00 0 00 0 00 0 1 11 1 11 1 11 1 2 22 2 22 2 22 2 3 33 3 33 3 33 3 4 44 4 44 4 44 4 5 55 5 55 5 55 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 012 3 45 6 78 0

int/nop0

1

fp/nop1

0 3

flt−st

1 3

int−st

3 7

loads

0 2 4

imull

0 2 4

imullq

1 3

stx_c

1 5 6

fdivs

1 5 6

fdivq

b) Discrete−representation machine description (5 resources, 59 resource usages) 0 00 0 00 0 00 0 1 11 1 11 1 11 1 2 22 2 22 2 22 2 3 33 3 33 3 33 3 4 44 4 44 4 44 4 5 55 5 55 5 55 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 012 3 45 6 78 0’

int/nop0

1’

fp/nop1

0’ 2’

flt−st

1’ 2’

int−st

2’

loads

0’ 3’

imull

0’ 3’

imullq

1’ 2’

stx_c

1’ 4’

fdivs

1’ 4’

fdivq

c) Bitvector−representation machine description with 12 cycles per 64 bit memory word (5 resources, 103 resource usages) 0 00 0 00 0 00 0 1 11 1 11 1 11 1 2 22 2 22 2 22 2 3 33 3 33 3 33 3 4 44 4 44 4 44 4 5 55 5 55 5 55 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 0 12 3 45 6 78 9 012 3 45 6 78 0’

int/nop0

1’

fp/nop1

0’ 2’

flt−st

1’ 2’

int−st

2’

loads

0’ 3’

imull

0’ 3’

imullq

1’ 2’

stx_c

1’ 4’

fdivs

1’ 4’

fdivq

Figure 2.6: Reservation tables for the DEC Alpha 21064. 34

a) Original machine description for the MIPS R300 / R3010, part 1 (22 resources, 260 resource usages)

b) Discrete−representation machine description (7 resources, 110 resource usages)

00 0 00 0 00 0 01 11 1 11 1 11 1 22 2 2 01 2 34 5 67 8 90 12 3 45 6 78 9 01 2 3

c) Bitvector−representation machine description with 9 cycles per 64 bit memory word (7 resources, 128 resource usages)

0 0 00 0 00 00 0 11 1 11 1 11 1 0 1 23 4 56 78 9 01 2 34 5 67 8

0 1 2 3 4 5 7 8 9 10

0’ 2’ 3’ 6’

0 1 2 3 4 5 7 8 9 10

0’ 2’ 3’ 6’

0 1 2 3 4 7 8 9 10

0’ 3’

0 1 2 3 4 7 8 9 10 11

0’ 2’ 3’ 4’

0 1 2 3 4 7 8 9 10 11

0’ 2’ 3’ 4’

0 1 2 3 4 7 8 9 10

0’ 2’ 3’

0 1 2 3 4 7 8 9 10

0’ 2’ 3’

0 1 2 3 4 7 8 9 12

0’ 1’ 3’

0 1 2 14 15 16 17 18

0’

00 0 00 0 00 0 01 11 1 11 1 11 01 2 34 5 67 8 90 12 3 45 6 78 0’ 2’ 3’ 6’ div.d

0’ 2’ 3’ 6’ div.s

0’ 2’ 3’ abs.d

0’ 2’ 3’ 4’ mul.s

0’ 2’ 3’ 4’ mul.d

0’ 2’ 3’ add.d

0’ 2’ 3’ ctv.d.w

0’ 1’ 2’ 3’ c.eq.d

0’ 1’ jal, j, bgtzal

Figure 2.7: Reservation tables for the MIPS R3000/R3010 (part 1). 35

a) Original machine description for the MIPS R3000 / R3010, part 2 (22 resources, 260 resource usages) 0 00 0 0 0 0 0 0 0 1 11 1 1 1 1 1 1 1 2 22 2 2 22 2 2 2 33 3 3 3 3 3 0 12 3 4 5 6 7 8 9 012 3 4 5 6 7 8 9 0 12 3 4 56 7 8 9 01 2 3 4 5 6 0 1 2 7 8 9 12 13 14 15

ctc1

0 1 2 7 8 9 10 14 15

lwc1

0 1 2 14 15 16 18 19

mtlo, mthi

0 1 2 14 15 16 17 18

add, sb, lwl

0 1 2 14 15 19 20

mult

0 1 2 14 15 19 20

div

b) Discrete−representation machine description (7 resources, 110 resource usages) 0 00 0 0 0 0 0 0 0 1 11 1 1 1 1 1 1 1 2 22 2 2 22 2 2 2 33 3 3 3 0 12 3 4 5 6 7 8 9 012 3 4 5 6 7 8 9 0 12 3 4 56 7 8 9 01 2 3 4 0’ 1’ 2’

ctc1

0’ 2’

lwc1

0’ 5’

mtlo, mthi

0’

add, sb, lwl

0’ 5’

mult

0’ 5’

div

c) Bitvector−representation machine description with 9 cycles per 64 bit memory word (7 resources, 128 resource usages) 0 00 0 0 0 0 0 0 0 1 11 1 1 1 1 1 1 1 2 22 2 2 22 2 2 2 33 3 3 3 0 12 3 4 5 6 7 8 9 012 3 4 5 6 7 8 9 0 12 3 4 56 7 8 9 01 2 3 4 0’ 1’ 2’

ctc1

0’ 2’

lwc1

0’ 5’

mtlo, mthi

0’

add, sb, lwl

0’ 5’

mult

0’ 5’

div

Figure 2.8: Reservation tables for the MIPS R3000/R3010 (part 2). 36

Table 2.4 shows the results of the reductions for the MIPS R3000/R3010 [53] using the machine description presented by Proebsting and Fraser [76]. Comparing the original description to the 64 bit word bitvector reduction, the reduced machine description decreases the average word usage by a factor of 6.9. Proebsting and Fraser reported a (forward-only) nite-state automaton for this processor with 6175 states [76]. The reservation tables associated with the machine descriptions of the original model, the discrete reduction, and the 64 bit word bitvector reduction are shown, respectively, in Figures 2.7a/2.8a, 2.7b/2.8b, and 2.7c/2.8c.

2.6 Contention Query Module To evaluate the impact of reduced machine descriptions on the performance of contention queries, we implemented a contention query module for the two representations described in Section 2.4. The module supports four basic functions: check, assign, free, and assign&free for the placement of a given operation X in cycle j .

Check. This function determines whether operation X can be scheduled in cycle j in

the current partial schedule without generating resource contentions. For the discrete representation, this query checks each usage in the reservation table for operation X (o set by j cycles) against the corresponding entries in the reserved table. When a resource contention is detected, the function aborts its search; otherwise, every usage is checked before the query completes successfully. The number of usages tested by the function is thus bounded by the total number of usages in the reservation table for operation X . For the bitvector representation, this function simply amounts to \anding" each bitvector in the reservation table with the corresponding bitvector in the reserved table and checking for a 0 result. With k bitvectors packed per memory word, contentions for k consecutive cycles are e ectively detected by performing one \and" operation. The function aborts as soon as contentions are detected; otherwise, every word in the reservation table is checked before the query completes successfully. The number of words tested by the check function is thus bounded by the total number 37

of words in the reservation table for operation X .

Assign. This function reserves each of the resources required by operation X when

scheduled in cycle j . When a discrete representation is used, this function simply sets the ag of each reserved table entry that corresponds to a usage in the reservation table for operation X (o set by j cycles). When a bitvector representation is used, the words encoding the bitvectors of the reservation table for operation X (o set by j cycles) are \ored" into the corresponding words encoding the bitvectors of the reserved table.

Free. This function releases the resources reserved by an operation X scheduled

in cycle j . When a discrete representation is used, this function simply resets the

ag of each reserved table entry corresponding to a usage in the reservation table for operation X (o set by j cycles). When a bitvector representation is used, this function simply \ands" the complement of the words encoding the bitvectors of the reservation table for operation X (o set by j cycles) into the words encoding the bitvectors of the reserved table.

Assign&free. Unlike the assign function, the assign&free function rst ensures

that the resources that are required by operation X when scheduled in cycle j are available. If any of the resources are already reserved by other operations, these operations (and possibly others [78][83]) will be unscheduled and their resources released. To identify which operations to unschedule, if any, the assign&free function uses an additional eld per entry in the reserved table. This eld, referred to as op-id, associates each cycle for each resource with the id of the particular operation, if any, that currently uses it in the partial schedule. The choice of using the assign versus the assign&free function depends on the scheduling algorithm used by the compiler. If, within a unit of compilation, a scheduler is guaranteed to assign operations to contention-free cycles exclusively, then the more ecient assign function should always be used in this unit. Otherwise, because of the additional bookkeeping required by the op-id elds, the assign&free function must be used to assign every operation, regardless of whether the assigned operation 38

con icts with others or not. The choice of the assign versus the assign&free function does not a ect the check and the free functions, since, in our implementation, they do not use or modify the op-id elds. Note that this can result in stale information in the op-id elds (i.e. the free function does not reset the op-id elds after unscheduling an operation), but it has no consequences since we only use the op-id elds if the resource is currently assigned. For the discrete representation, the assign&free function iterates over each reserved table entry that corresponds to a resource usage in the reservation table for operation X (o set by j cycles). For each entry, it veri es that no operation currently uses that resource; otherwise, it identi es the con icting operation by using the corresponding op-id eld, nds the cycle in which the con icting operation was scheduled, and invokes the free function to unschedule the con icting operation. The assign&free function then assigns operation X as scheduled in cycle j by reserving the required resources in the appropriate cycles and assigning their op-id elds to X in those cycles. For the bitvector representation, we use a strategy that does not update the op-id elds unless required. In the initial mode, referred to as optimistic mode, the op-id elds are ignored and the resources are checked simply by selecting each bitvector word in turn from the X reservation table, \anding" it to the corresponding reserved table word, and testing for a nonzero result. If no con ict is detected, the resources required by operation X in cycle j are reserved by \oring" the bitvector words of the X reservation table into the corresponding bitvector words of the reserved table. However, if a resource con ict is detected (i.e. an \and" result is nonzero), some operations must rst be unscheduled because of resource contentions. Since the op-id elds are not updated in optimistic mode, the operations in the partial schedule are scanned and the op-id elds corresponding to their resource usages are reconstructed when the rst operation needs to be unscheduled. The op-id elds are then used to identify the con icting operations which are then unscheduled using the free function, after which operation X is simply assigned in schedule j as performed by the assign function with the bitvector representation. Experimental 39

evidence shows that the scheduler algorithm used in our study rarely unschedules operations because of resource con icts; however, once it does, it tends to unschedule a large number of operations in that loop. Therefore, it is advantageous to ignore the op-id elds initially in the hope that no operations will need to be unscheduled. However, it is bene cial to keep them up-to-date once they have been reconstructed for a given loop. This bookkeeping is performed in update mode. In update mode, the assign&free function for the bitvector representation updates the bitvectors of the reserved tables, as performed by the assign function with the bitvector representation, and also assigns the op-id elds, as described for the assign&free function with the discrete representation. While this dual-mode strategy worked well in our experiments, di erent strategies may work better for other scheduling algorithms and machine descriptions. Also, some scheduler algorithms already know whether the operation that is currently being assigned to a cycle will con ict with some operations in the partial schedule. For example, the Iterative Modulo Scheduler [78] rst attempts to schedule an operation in a contention-free cycle; however, if no such cycle is found, it chooses some cycle in which to assign the operation, regardless of the resource con icts. Thus the assign&free function must always be used. However, when the assignment is know not be con ict free, this information is conveyed to the assign&free function, thereby performing the necessary updates, but eliminating the code that checks for contentions. Note that the four basic functions iterate over the resource usages when the discrete representation is used, and over the words encoding the bitvectors when the bitvector representation is used. In addition, the assign&free function for the bitvector representation also iterates over the resource usages in update mode, since it updates the op-id elds as well. Note also that the work performed to handle a resource usage is roughly comparable to the work required to handle a word of bitvectors. The contention query module provides an additional function that facilitates the nding of a contention-free operation at a given cycle in the presence of alternatives. Alternative operations were introduced in Section 2.2 as operations that perform an 40

identical task but use di erent resources, e.g. both the add1 and add2 operations perform an addition but use distinct functional units. Alternative operations may contain more operations than those caused by replicated hardware structures, e.g. a move operation may also be implemented as add 0 or mult 1. The additional function used to support alternative operations is de ned as follows.

Check-with-alt. This function determines if operation X , or any of its alternative

operations, can be scheduled in cycle j without resource contentions. If so, the function returns one of the contention-free operations; otherwise, it returns an error value. In this chapter, we simply implemented this function by repetitively calling the check function for each of the alternative operations until it succeeds. Other more ecient techniques could be implemented by, for example, factoring resource reservations that are used by all or a subset of the alternative operations and checking each factor at most once. The check-with-alt function, when compiling for the Cydra 5 [11], is used to determine which of the two address generation units should be used, e.g. adding to an address register can be performed by either the a1add or the a2add operation. Other alternative operations that execute on one of the two address generation units are the a1sub and the a2sub operations as well as the a1brev and the a2brev operations. Moreover, selecting one of two input operands (depending on the value of a third binary input operand) can be implemented by either the integer or oating point pipe, using either the iselect or the fmselect operation, respectively.

2.7 Performance of the Contention Query Module To evaluate the impact of the contention query module and the reduced machine representation, we have selected a state-of-the-art scheduler with low computational complexity that results in high performance code, and is thus likely to be representative of the scheduling algorithms used in future high-performance compilers. 41

We implemented a scheduler for software pipelined loops using the algorithm developed and described by Rau in [78]. This algorithm, referred to as the Iterative Modulo Scheduler, exploits the instruction level parallelism present in loop iterations by overlapping the execution of consecutive iterations. This modulo scheduler has been designed to produce schedules with near optimal steady-state throughput while eciently dealing with realistic machine models. Experimental ndings presented by Rau [78] show that the algorithm of this scheduler requires the scheduling of only 59% more operations than does acyclic list scheduling while resulting in schedules that are optimal in II for 96.0% of loops in the benchmark suite described below. The key characteristic of the scheduling algorithm is its iterative nature: it schedules operations using a priority function that gives precedence to operations along critical paths and allows prior scheduling decisions to be reversed, unscheduling operations when data dependences are violated or resource contentions occur. The algorithm satis es the de nition of the unrestricted scheduling model since it schedules operations in arbitrary order and may reverse scheduling decisions. We used a benchmark of loops obtained from the Perfect Club [13], SPEC-89 [91], and the Livermore Fortran Kernels [65] which consists exclusively of innermost loops with no early exits, no procedure calls, and fewer than 30 basic blocks, as compiled by the Cydra 5 Fortran77 compiler [27]. The input to the scheduling algorithms consists of the Fortran77 compiler intermediate representation after load-store elimination, recurrence back-substitution, and IF-conversion. The benchmark suite consists of the 1327 loops successfully modulo scheduled by the Cydra 5 Fortran77 compiler, with 1002 from the Perfect Club, 298 from SPEC, and 27 from the Livermore Fortran Kernels. This benchmark of loops was used in [78] and was provided to us by Rau. The characteristics of the generated schedules for the 1327 loop benchmark are summarized in Table 2.5. The rst two rows indicate the number of operations per loop iteration and the initiation interval of the software pipelined loops. The third row shows to the ratio of the initiation interval (II ) to the minimum initiation interval (MII ), and is a good indication of the quality of the produced schedules. We see that in 95.6% of the loops, our implementation of the Iterative Modulo Scheduler produces 42

Measurements: number of operations initiation interval (II ) II=MII schedule decisions / operation unscheduled ops / schedule decision due to resource contentions due to violated data dependences alternatives / schedule decision check-with-alt / schedule decision check / schedule decision check/check-with-alt

min 2:00 1:00 1:00 1:00 0:00 0:00 0:00 1:00 1:00 1:00 1:00

freq median average max 0:4% 10:00 17:54 161:00 28:7% 3:00 11:52 165:00 95:6% 1:00 1:01 1:50 78:7% 1:00 1:52 6:00 61:3% 0:00 0:52 6:00 93:2% 0:00 0:07 4:00 66:5% 0:00 0:45 6:00 88:0% 1:00 1:12 2:00 51:8% 1:00 4:49 96:00 49:5% 2:00 4:74 96:00 94:4% 1:00 1:06 2:00

Table 2.5: Characteristics of the 1327 loop benchmark suite when using the Iterative Modulo Scheduler. a schedule with the minimum initiation interval, i.e. achieving the maximum feasible steady-state throughput. This ratio is within 0.5% of that obtained in [78]. A key feature of the Iterative Modulo Scheduler algorithm is that it can reverse a limited number of scheduling decisions. In this chapter, the scheduler may perform up to 6N scheduling decisions, where N is the number of operations in the loop being scheduled. When the scheduler exceeds the allocated budget of scheduling decisions, set to 6 times the number of operations in this study, the scheduling algorithm makes a new attempt with a larger initiation interval. The fourth row of Table 2.5 indicates that in 78.7% of the loops, no scheduling decision was ever reversed. The actual ratio of schedule decisions to the number of operations is 1.52, when averaged over all loops and all scheduling attempts, including 9.6% of the total attempts in which the 6N upper limit was exceeded. The ratio is highly sensitive to the upper limit used by the scheduler; e.g. an upper limit of 2N results in an average ratio of 1.14, including 11.3% of the total attempts in which the 2N upper limit was exceeded. The contention query module used in this section closely corresponds to the one described in Section 2.6, but adapted to handle the periodicity of the modulo sched43

ules (i.e. using a Modulo Reservation Table [48][73]). Since the scheduling algorithm used in our experiment can schedule an operation even though it may result in resource contentions, we use the assign&free function here to commit each scheduling decision. Row 6 of Table 2.5 indicates that 6.8% of the scheduling decisions result in resource contentions, with an average of 0.07 unscheduled operations per scheduling decision due to resource contentions. As shown in row 7, operations are also unscheduled due to violated dependence constraints. Overall, 13.4% of all unscheduled operations are due to resource contentions, and 86.6% are due to violated data dependences. Since the machine considered in our experiment has alternative operations, we use the check-with-alt function here to determine a cycle of the schedule in which one of the alternative operations can be scheduled without resource contention. Row 8 of Table 2.5 indicates that 88.0% of the scheduling decisions have only one operation to consider, and the remaining 12.0% have two alternative operations to consider. Row 9 shows that in 51.8% of the cases one call to check-with-alt is sucient to nd a cycle in which one of the alternative operations can be scheduled without resource contention. Otherwise, the scheduler issues two queries in 15.7% of the cases, three in 8.5%, four in 5.5%, ve in 3.0%, six in 2.0%, seven to twenty in 10.2%, and up to 96 queries in the remaining 3.3%, with median of 1 query and an average of 4.49 queries. Because one check-with-alt call may invoke up to one check query for each alternative operation, the number of check calls per scheduling decision is also provided, in row 10 of Table 2.5. The last row shows the e ective number of alternative operations checked per check-with-alt call; interestingly, even though 12.0% of the scheduling decisions have two alternative operations, only 5.6% of the check-with-alt calls test both operations. The performance of each function is rst quanti ed by counting the number of units of work performed by each function, where one work unit handles a single resource usage or a single word of bitvectors in a reservation table. This metric is directly proportional to the work performed in the innermost loop of the contention 44

query module, and is thus a good indicator of the essential computation performed by the module. When computing the total units of work associated with a machine representation, the overhead incurred in the transition from the optimistic mode to the update mode in the assign&free function is also taken into account. The average number of work units per function call is given for each of the functions and machine representations for the complete Cydra 5 machine description. The machine representations are those in Table 2.1. Representation: Objective function minimizing: check-with-alt assign&free free

Weighted sum: Improvement factor:

original discrete { res-uses 2.77 5.68 6.48 3.59 1.00

2.17 2.14 2.58 2.20 1.63

bitvectors

(32 bits)

(64 bits)

1-bitvector-

2-bitvector-

4-bitvector-

uses

uses

uses

2.01 2.34 2.23 2.08 1.73

1.32 2.02 1.58 1.46 2.46

1.17 1.88 1.29 1.30 2.76

freq. 74.7% 16.6% 8.7%

Table 2.6: Performance of the query functions when using the Iterative Modulo Scheduler (average work units per call). First consider in Table 2.6 the average work units associated with the discrete representations. Although the reduced machine representation in Column 2 of Table 2.6 eliminates much of the redundancy in Column 1, decreasing the average resource usages per operation by a factor of 2.2 (from 18.5 to 8.5 in Table 2.1), the work performed by the check-with-alt function decreases by a factor of only 1.3, from 2.77 to 2.17 units per call. This e ect may be attributed to the fact that the redundancy in the original machine description helps in nding resource contention more quickly. However, the redundancy in the original machine description signi cantly a ects the work performed by the assign&free and free functions because both functions, unlike check-with-alt, necessarily iterate over each resource usage. As a result, the work performed by the assign&free and free functions decreases by a factor of 2.7 and 2.5, respectively. 45

The average work units associated with the bitvector representations and 1, 2, and 4 bitvectors per word is shown in Columns 3, 4, and 5 of Table 2.6, respectively. We can see that increasing the number of bitvectors per word signi cantly decreases the work performed by the check-with-alt and free functions, each of which iterates exclusively over memory words. The work performed by assign&free also decreases, but more moderately as it iterates either over words in the optimistic mode, over words and resource usages in the update mode, and also incurs a mode transition overhead. The dual mode strategy is e ective at keeping this overhead low, as only 3.0% of the scheduling attempts and 11.7% of the assign&free calls are performed in update mode. The overall performance in units of work is obtained by multiplying the average work units performed by each function by the relative frequency of scheduler calls to that function and summing these products. In Table 2.6, these frequencies are shown in the rightmost column and the weighted average work units for the ve machine descriptions are given in the two last rows. When using a discrete representation, we see that reducing the machine description decreases the weighted average by a factor of 1.63, from 3.59 to 2.20 work units per call. When using a 64 bit word bitvector representation, reducing the machine description decreases the weighted average by a factor of 2.76, from 3.59 to 1.30 work units per call. This weighted average is thus only 30% above the absolute minimum since the contention query module must handle at least one unit of work for one of the alternative operations to detect, reserve, or free the resources modeled in any nite-resource machine model. We now consider the actual execution time of the contention query module, for each of the machine representations and query functions as measured on a Sun UltraSparc-1 workstation. Three versions of the contention query module are used here: one for the discrete representation, one for the bitvector representation with one bitvector per memory word, and one for the bitvector representation with two or more bitvectors per word. The di erence between the last two versions is that the code for the bitvector representation with two or more bitvectors per word (1) has more 46

expensive logic to test a single bit in assign&free when con icting operations must be unscheduled and (2) must detect and handle all possible word alignments of the reservation and reserved table bitvectors. Each code version was carefully optimized, and compiled with the native Sun compiler at its highest optimization level. Representation: Objective function minimizing: check-with-alt assign&free free

Weighted sum: Improvement factor:

original discrete { res-uses 2.88 5.53 4.80 3.49 1.00

2.49 3.10 2.66 2.60 1.34

bitvectors

(32 bits)

(64 bits)

1-bitvector-

2-bitvector-

4-bitvector-

uses

uses

uses

2.00 2.44 1.51 2.03 1.72

2.02 2.55 1.56 2.07 1.68

1.98 2.56 1.51 2.04 1.71

freq. 74.7% 16.6% 8.7%

Table 2.7: Performance of the query functions when using the Iterative Modulo Scheduler (average microseconds per call). Table 2.7 presents the average execution time for each machine representation and each function of the contention query module. To alleviate the impact of random perturbations, the data was derived from the measurement with the lowest total scheduling and resource modeling time over the entire 1327-loop benchmark and over 3 measurements on an idle workstation. Consider rst the average execution times associated with the discrete representations. We notice that the trends are similar to the ones in Table 2.6 when using the reduced representation instead of the original representation, i.e. the execution time decreases moderately for the check-with-alt function, and decreases signi cantly for the assign&free and free functions. As a result, the weighted average decreases by a factor of 1.34, from 3.49 to 2.60 microseconds per call. This factor, as expected, is smaller than the 1.63 factor obtained when accounting only for the work units, because of runtime overhead such as function calls, initialization of loop invariants, and other factors that are not included in the units of work metric. The average execution times associated with the bitvector representations also indicate a signi cant decrease for all three functions. Interestingly, the best performance 47

is obtained by machine representation with one 1 bitvector per word, decreasing the weighted average by a factor of 1.72, from 3.49 to 2.03 microseconds per call. In other words, the contention query module completes its work in 58.2% of the original representation time, on average, if the module uses the machine representation with 1 bitvector per word. Comparing the average execution times of the three bitvector representations in Table 2.7, we may notice that the average execution time of the check-with-alt function does not vary much, but is the fastest for the representation with 4 bitvectors per word and slowest for the one with 2 bitvectors per word. The representation with 4 bitvectors per word is the fastest because the decrease in average work units outweigh the overhead required to test and handle the alignment of the words encoding the bitvectors. Similarly, the representation with 2 bitvectors per word is the slowest because the decrease in average work units does not o set the code overhead. The behavior of the free function follows a similar pattern. The execution time of the assign&free function is di erent, however, because the function must test individual bits to determine which operation to unschedule in update mode, and the overhead apparently outweighs the bene t of packing two or more bitvectors per word for this function. We may conclude from Table 2.7 that the bitvector representations are consistently faster than the discrete representation. Also, when we may pack only a small number of bitvectors per word, the fastest representation is the one with 1 bitvector per word because of its lower code overhead. However, with more than 1 bitvector per word, the representation with the largest number of bitvectors per word appears to be the better choice, as the check-with-alt function, which is the most frequently called function, bene ts from it. We have seen so far that, for the benchmark and scheduler considered in our experiment, reducing the machine description results in a 58.2% decrease in execution time for the contention query module. We now relate this results to the overall scheduling time, including the execution time of the scheduler as well as the contention query module. 48

1.4 Free Cummulative time (seconds)

1.2 Assign&Free 1 Check-with-alt 0.8 0.6 0.4 0.2 0 Original

Discrete

Bitvectors Bitvectors Bitvectors (1-cycle- (2-cycle- (4-cycleword) word) word)

Figure 2.9: Performance of the contention query module when using the Iterative Modulo Scheduler. 2.5

Cummulative time (seconds)

Free 2

Assign&Free Check-with-alt

1.5

Scheduler 1

0.5

0 Original

Discrete

Bitvectors Bitvectors Bitvectors (1-cycle- (2-cycle- (4-cycleword) word) word)

Figure 2.10: Performance of the contention query module, including the Iterative Modulo Scheduler. 49

The cumulative time spent in the contention query module when scheduling the entire 1327-loop benchmark is shown in Figure 2.9 for each machine representation. The cumulative time spent in our implementation of the Iterative Modulo Scheduler (including the time to initialize the reserved table, compute the priority function, and schedule each operation with up to 6N scheduling decisions) is pictured in Figure 2.10. Consider the discrete representations: if the reduced machine description is used instead of the original machine description, the scheduling time is decreased by a factor of 1.18 for the entire benchmark, i.e. the scheduler completes in 84.7% of the time. If the representation with 1 bitvector per word is used instead, the scheduling time is decreased by a factor of 1.32, i.e. the scheduler completes in only 75.8% of the time. In the above measurements, the full functionality of the contention query module was used, as required by the Iterative Modulo Scheduler. We now investigate the performance of the contention query module for more conventional schedulers, which schedule an operation to a speci ed cycle only if the operation does not con ict with any operations currently in the partial schedule. Because of this restriction on the schedulers, we may use the assign function instead of the more expensive assign&free function. Once resources are committed to an operation, that operation is never unscheduled due to resource contentions, although it may still be unscheduled due to violated data dependences. To evaluate the performance of the contention query module in this context, we modi ed the Iterative Modulo Scheduler so that it only schedules an operation in a cycle in which the operation is known not to result in resource contentions. If no such cycle is found, the modulo scheduler simply increases the initiation interval by one and starts all over again. We refer to this scheduling algorithm as the Modi ed Iterative Modulo Scheduler. The characteristics of the generated schedules for the 1327 loop benchmark are summarized in Table 2.8. Because of the above restriction, the Modi ed Iterative Modulo Scheduler nds a schedule with minimum initiation interval in only 93.8% of the loops, instead of 95.5% of the loops with the unmodi ed scheduler. Other 50

Measurements: number of operations initiation interval (II ) II=MII schedule decisions / operation unscheduled ops / schedule decision due to resource contention due to violated data dependences alternatives / schedule decision check-with-alt / schedule decision check / schedule decision check/check-with-alt

min freq median average max 2:00 0:4% 10:00 17:54 161:00 1:00 28:6% 3:00 11:58 165:00 1:00 93:8% 1:00 1:01 1:50 0:23 0:1% 1:00 1:40 6:00 0:00 67:3% 0:00 0:45 5:00 0:00 100:0% 0:00 0:00 0:00 0:00 67:3% 0:00 0:45 5:00 1:00 87:0% 1:00 1:13 2:00 1:00 56:1% 1:00 2:65 98:00 1:00 53:5% 1:00 2:91 98:00 1:00 90:4% 1:00 1:10 2:00

Table 2.8: Characteristics of the 1327 loop benchmark suite using the Modi ed Iterative Modulo Scheduler. noticeable di erences with respect to the unmodi ed scheduler are that the modi ed scheduler generates fewer scheduling decisions per operation (an average of 1.40 instead of 1.52 decisions in row 4 of the table) and checks fewer cycles per scheduling decision (an average of 2.65 instead of 4.49 cycles in row 9 of the table). The performance of the contention query module is given in Tables 2.9 and 2.10 in average work units and average execution times, respectively. Note that since a di erent scheduling algorithm is used, the actual queries issued by the scheduler are di erent, and thus we may not directly compare the data from Tables 2.9 and 2.10 (obtained with the modi ed scheduler) to the data Tables 2.6 and 2.7 (obtained with the unmodi ed scheduler). The rst observation is that the cost of the assign function is signi cantly lower than the cost of the assign&free function, both in term of average work units and average execution time. This e ect is especially signi cant for the bitvector representations since, unlike assign&free, the assign function only iterates over memory words and does not incur any mode transition overhead. As a result, assign has a lower average work units per call than either check-with-alt or free, for each of the reduced machine representations shown in Table 2.9. This results, in turn, in a 51

Representation: Objective function minimizing: check-with-alt assign free

Weighted sum: Improvement factor:

original discrete { res-uses 3.04 5.12 5.35 3.80 1.00

1.87 1.86 2.05 1.89 2.01

bitvectors

(32 bits)

(64 bits)

1-bitvector-

2-bitvector-

4-bitvector-

uses

uses

uses

1.66 1.40 1.47 1.57 2.42

1.27 1.18 1.22 1.24 3.06

1.17 1.09 1.11 1.15 3.30

freq. 64.8% 24.4% 10.8%

Table 2.9: Performance of the query functions when using the Modi ed Iterative Modulo Scheduler (avg. work units per call). Representation: Objective function minimizing: check-with-alt assign free

Weighted sum: Improvement factor:

original discrete { res-uses 3.05 4.91 4.44 3.66 1.00

2.35 2.78 2.53 2.48 1.48

bitvectors

(32 bits)

(64 bits)

1-bitvector-

2-bitvector-

4-bitvector-

uses

uses

uses

2.07 2.09 1.43 2.00 1.83

2.07 2.12 1.52 2.02 1.81

2.06 2.10 1.49 2.01 1.82

freq. 64.8% 24.4% 10.8%

Table 2.10: Performance of the query functions when using the Modi ed Iterative Modulo Scheduler (average microseconds per call). lower weighted average, as little as 1.15 work units per call for the representation with 4 bitvectors per word, only 15% above the absolute minimum since the contention query module must handle at least one unit of work per call in any nite-resource machine model. The second observation is that the improvement factors (comparing the performance of the reduced machine representations relative to the original machine representation) are higher when using the modi ed scheduler instead of the unmodi ed scheduler. For example, the improvement factor for the work units ranges from 2.01 to 3.30 in Table 2.9 versus 1.63 to 2.76 in Table 2.6. Similarly, the improvement factor for the execution times ranges from 1.48 to 1.83 in Table 2.10 versus 1.34 to 1.72 in Table 2.7. 52

0.9 Free

Cummulative time (seconds)

0.8 0.7

Assign

0.6

Check-with-alt

0.5 0.4 0.3 0.2 0.1 0 Original

Discrete

Bitvectors Bitvectors Bitvectors (1-cycle- (2-cycle- (4-cycleword) word) word)

Figure 2.11: Performance of the contention query module when using the Modi ed Iterative Modulo Scheduler. 1.8 Free

Cummulative time (seconds)

1.6 1.4

Assign

1.2

Check-with-alt

1 Scheduler 0.8 0.6 0.4 0.2 0 Original

Discrete

Bitvectors Bitvectors Bitvectors (1-cycle- (2-cycle- (4-cycleword) word) word)

Figure 2.12: Performance of the contention query module, including the Modi ed Iterative Modulo Scheduler. 53

Summarizing the performance of the contention query module for conventional schedulers that assign an operation to a given cycle only if the operation does not con ict with any operations currently in the partial schedule, we can decrease the weighted average work units by a factor of 3.30, from 3.80 to 1.15 units per call, by using the machine representation with 4 bitvectors per word, and can decrease the weighted average execution time by a factor of 1.83, from 3.66 to 2.00 microseconds per call, by using the machine representation with 1 bitvector per word instead of the original machine representation. When including the execution time of the scheduler (see Figures 2.11 and 2.12), the total scheduling time is decreased by a factor of 1.35 for the entire benchmark, i.e. the scheduler completes in 74.1% of the time.

2.8 Summary In this chapter, we have presented an ecient contention query module that supports the elaborate scheduling techniques used by today's high performance compilers. In particular, we support unrestricted scheduling models, where the operation currently being scheduled may be placed before some already scheduled operations and backtracking is performed to produce highly optimized software-pipelined and critical-path sensitive schedules. We also support precise boundary conditions where resource requirements may dangle from predecessor basic blocks to permit e ective latency-hiding techniques. Our contention query module is based on a reduced machine description that results in signi cantly faster detection of resource contentions while exactly preserving the scheduling constraints present in the original machine description. This approach achieves three goals. First, it handles queries signi cantly faster which is increasingly important as queries for contentions are issued within the innermost loop of a scheduler and their complexity increases with machine complexity. Second, it allows a hardware engineer to write an accurate structural description of a machine in terms that are close to the actual hardware structure without spending e ort to simplify the machine description. Using this approach, a reduced machine description 54

is then generated for the convenience and speed of the compiler in an error-free and automated fashion. Third, this approach does not limit the functionality of the contention query module. Additional functionality that is fully supported here, and used in our experiments, includes scheduling an operation earlier than others that are already scheduled, unscheduling operations due to resource contentions, and eciently handling periodic resource requirements found in software pipelined schedules. Experiments with three ways of describing machines indicate that our approach addresses the perceived weakness of resource modeling approaches based on reservation tables. Because the machine descriptions are reduced, all resource contentions of one query are detected by a conservative (unweighted) average of 1.6 (MIPS R3000/R3010), 2.0 (DEC Alpha 21064), and 3.4 (Cydra 5) \and" operations when using the 64 bit word bitvector representation. Moreover, the memory requirements needed to store the reserved resources of a schedule are small, as a single 64 bit word may encode the bitvectors of 4 (Cydra 5), 9 (MIPS R3000/R3010), or 12 (DEC Alpha 21064) schedule cycles. Dynamic measurements obtained when scheduling 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels for the Cydra 5 machine indicate that the weighted average work units (i.e. the essential work performed in the innermost loops of the contention queries) decreases by a factor of 2.76 for a scheduler using the full functionality of the query contention module and 3.30 when using a subset of the module's functionality. In turn, this decrease in units of work results, respectively, in a 1.72 to 1.83 faster total execution time over all the contention queries. When the execution time of the scheduler itself is included (i.e. accounting for the time to initialize the reserved table, compute the priority function, as well as schedule each operation with up to 6N scheduling decisions per scheduling attempt), the scheduler completes in 75.8% and 74.1% of its original time, respectively. These improvements are obtained by using highly reduced machine descriptions instead of the original machine description. In our execution time measurements of scheduling 1327-loops benchmark for the Cydra 5, the most e ective representation is the one with 1 bitvector per word. Empirical evidence from the work units metric, however, 55

suggests that packing more bitvectors per word may result in a more e ective representation once a threshold is crossed that amortizes the overhead incurred by this approach over a more signi cant reduction in work units.

56

CHAPTER 3 Overview of Modulo Scheduling Current research compilers for VLIW and superscalar machines focus on exposing more of the inherent parallelism in an application to obtain higher performance by better utilizing wider issue machines and reducing the schedule length of a code. There is generally insucient parallelism within individual basic blocks and higher levels of parallelism can be obtained by also exploiting the instruction level parallelism among successive basic blocks. Software pipelining is a technique that exploits the instruction level parallelism present among the iterations of a loop by overlapping the execution of consecutive loop iterations. With sucient overlap, some machine resources can be fully utilized, resulting in a schedule with maximum steady state throughput. Modulo scheduling is a software pipelining technique that results in high performance code while requiring only modest compilation time by restricting the scheduling space [80]. Modulo scheduling uses the same schedule for each iteration of a loop and initiates successive iterations at a constant rate, i.e. one Initiation Interval (II clock cycles) apart. This restriction is known as the modulo scheduling constraint and forces a resource to be used by one iteration no more than once within each set of times that are congruent modulo II. The scope of modulo scheduling has been widened to a large variety of loops. Loops with conditional statements are handled using hierarchical reduction [55] or IF- and Reverse-IF-conversion [95]. Modulo scheduling has also been extended to 57

a large variety of loops with early exits, such as while loops [87][89]. Furthermore, the code expansion due to modulo scheduling can be eliminated by using special hardware, e.g. support for rotating register les and predicated execution [82]. Since modulo scheduling exploits a higher level of parallelism, it results in higher register requirements because more values are needed to support more concurrent operations. This e ect is inherent to parallelism in execution and will be exacerbated by wider machines and higher latency pipelines [64]. As a result, developing scheduling techniques that exploit instruction level parallelism while containing the register requirements is crucial to the performance of future machines. Our research aims at understanding the fundamental relationship between instruction level parallelism and register requirements for a set of benchmarks under realistic assumptions. To gain an understanding of this fundamental relationship, we explore optimal scheduling algorithms that result in a schedule with the highest steady-state throughput over all modulo schedules, and the minimum register requirements among such schedules. We also investigate fast register-sensitive heuristics that are directly applicable to production compilers. In this chapter, the concepts of the modulo scheduling approach are presented in Section 3.1. The register requirements associated with a modulo schedule are illustrated in Section 3.2 and an empirical illustration of the modulo scheduling solution space is provided in Section 3.3. We develop our algorithmic approach in Section 3.4 and formulate its input in Section 3.5. We conclude this chapter by presenting related work in Section 3.6.

3.1 Modulo Scheduling Approach In this chapter, we illustrate the modulo scheduling approach based on a simple example for a hypothetical processor with three fully-pipelined general-purpose functional units whose operation latencies are listed in Table 3.1. The code and machine were selected to obtain a concise example; however, the modulo scheduling approach and the scheduling algorithms developed in this and the following chapters are gen58

eral with respect to the code, the numbers and types of functional units, resource reservation tables, and operation latencies. Functional Units: 3 general purpose functional units Operations: load/store add/sub mult div Latencies: 1 1 4 8

sqrt

10

Table 3.1: Example target machine (hypothetical processor). Example 3.1 This simple example illustrates the key concepts of the modulo schedul-

ing approach. The loop kernel is: z[i] = 2*x[i] + y[i], where the x[i] and y[i] values are read from memory, x[i] is multiplied by 2, incremented by y[i], and stored in z[i], as shown in the dependence graph of Figure 3.1a. The vertices of the dependence graph correspond to operations and the edges correspond to data in registers owing from one operation to another. Figure 3.1b presents a valid schedule for one iteration of this simple kernel.

Modulo scheduling exploits the instruction level parallelism present among the iterations of a loop by overlapping the execution of consecutive iterations [80]. Modulo scheduling restricts the space of software pipelined schedules by requiring every loop iteration to use the same schedule and by initiating consecutive loop iterations at a constant rate, i.e. one Initiation Interval (II clock cycles) apart. An execution trace produced by the modulo schedule of Example 3.1 is shown in Figure 3.1c. The vertical axis represents the time in cycles and the horizontal axis represents the iteration space of the loop. In this execution trace, the initiation interval is 2 as a new iteration is initiated every two clock cycles. The stream of operations actually executed by the target processor is shown Figure 3.1d. This graph is obtained by collapsing Figure 3.1c along a vertical axis. We distinguish three distinct phases in this stream of operations. The rst phase, labeled prologue, includes time 0 to 5 and corresponds to the lling of the software pipeline as the rst 3 iterations begin execution. The second phase, labeled steady-state, includes time 6 to 7 and corresponds to the steady-state of the software pipeline. Each 59

b) Schedule & terminology

a) Dependence graph

Stages:

lx Time: 0 1 ly

*

lx

ly Stage 0

*

2

Stage 1

3 4 +

Stage 2

+

5

st

6

Stage 3

7 st

c) Execution trace 0

Time: 0 lx 1 *

ly

1

2

lx

3

*

ly lx

4

7 8 9 10

+

st

Modulo Reservation Table (MRT)

3 Initiation Interval (II)

2

6

ly

*

d) Stream of operations

Iteration:

5

lx

+

ly

* lx

st +

ly

*

lx

ly Prologue

* lx

ly

* lx

ly

* lx

ly

*

+

+ st

Steady State

st

st

Epilogue

+

+

st

st +

+

11

st

12

st

Figure 3.1: Key concepts and terminology of modulo scheduling (Example 3.1).

60

additional loop iteration beyond iteration 3 in the execution trace would result in one additional instance of the cycle pair 6 and 7 in the steady-state phase with exactly the same operation schedule. The third phase, labeled epilogue, includes time 8 to 12 and corresponds to the draining of the software pipeline as the last 3 iterations complete. The performance of a modulo schedule is a function of the initiation interval (II ) and the schedule length (SL) of one iteration, where the former impacts the steady-state performance and the latter a ects the transient performance of a modulo schedule. More precisely, the total number of cycles required to execute a modulo schedule is [77][78]:

SL + II  (n ? 1)

(3.1)

where n is the number of times the loop is executed. The schedule length corresponds to the number of cycles between the earliest issue time and the latest completion time among the operations of one loop iteration, including the branch operation and its delay slots, if any. The resource requirements of a modulo schedule are governed by the resources consumed in the steady-state phase since it corresponds to the phase with the maximum number of concurrent loop iterations. These resource requirements are best summarized by using a Modulo Reservation Table (MRT) [48][73], which contains II rows, one per clock cycle, and one column for each resource at which con icts may occur. An MRT may be obtained by recording the resource requirements in any II consecutive cycles of the steady state portion of an execution trace, e.g. time 6 and 7 in Figure 3.1d. Note that the resource requirements of a modulo schedule can be quickly computed from the schedule of a single loop iteration because of the constraints on the scheduling space imposed by the modulo scheduling approach, i.e. using the same schedule template for each loop iteration and initiating successive iterations at a constant rate. The resource requirements are simply obtained by collapsing the schedule for 61

one iteration to a table of II rows using wraparound. The MRT associated with the execution trace is shown in Figure 3.1b; it satis es the resource constraints of our target machine since no more than 3 operations of any kind are issued in each row of the MRT. The MRT can also model specialized resources by allowing only certain types of operations in each column. The schedule of an iteration can be divided into stages of II cycles each. The number of stages in an iteration is referred to as the stage count (SC ) [82] and corresponds to the maximum number of concurrent iterations in a modulo schedule. Figure 3.1b depicts the 4 stages associated with the schedule of Example 3.1. We notice that indeed the number of concurrent iterations never exceeds 4 in the execution trace depicted in Figure 3.1c. In general, the stage count is computed from the schedule length (SL) of one loop iteration, SC = d SL II e. For simplicity, in our examples, we compute the stage count using the schedule length, SL0, de ned as the schedule length from the earliest issue time to the latest issue time of the operations in one loop iteration. Schedule lengths reported in our measurements, however, are SL, not SL0. As steady-state throughput is simply the reciprocal of II , choosing the minimum feasible II achieves the highest possible steady-state performance. The initiation interval is bounded by the minimum initiation interval (MII) [80], which is a lower bound on the smallest feasible value of II for which a modulo schedule can be found. This lower bound is constrained either by critical resources being fully utilized or by critical loop-carried dependence cycles. In general, MII may be noninteger. However, since the initiation interval of a modulo schedule must be some integer number of cycles, an integer valued MII can be obtained simply by rounding up to the next larger integer value, by unrolling the loop some number of times, or a combination of both [78]. The resource constrained lower bound, ResMII, is determined by accounting for all the resources consumed by each of the operations of an iteration. For example, if a resource is used x times on a machine with q identical copies of resource x, clearly II cannot be smaller than x=q because of the modulo scheduling constraint. 62

Determining the ResMII lower bound can be dicult, in general, when an operation can be mapped to di erent opcodes with distinct resource usages [78]. The recurrence constrained lower bound, RecMII, is determined by accounting for all the elementary cycles in the dependence graph that are created by recurrences, i.e. loop-carried dependences. For example, if the result of an operation is used by another operation in a subsequent loop iteration, the scheduler must ensure that the result of the rst operation is available to the second operation. The RecMII lower bound eliminates the initiation intervals for which we can guarantee that there are no schedules for which every dependence is ful lled. For example, if the sum of the latencies along a cycle is l and the sum of the dependence distances along the same cycle is d, then II must be larger than l=d. Calculating minimum RecMII becomes more complicated when several recurrence cycles have con icting requirements [24]. Techniques to compute RecMII are based on enumeration [27], minimization by a linear programming problem [29], or minimal cost-to-time ratio cycle problem [49][78]. Relating ResMII and RecMII to the available instruction level parallelism of a loop for a given machine, we can state that a loop has insucient instruction level parallelism for a given machine when RecMII, rather than ResMII, constrains MII. Or conversely, we may say that the machine has excessive resource parallelism for the loop. Patel has shown that for acyclic dependence graphs, that at least one functional unit can always be fully saturated by adding delays to a task schedule [73]. If we view a single loop iteration as a task, such delays might have to be applied between operations as well as between the successive pipeline stages used by a single operation. Since in our problem formulation, the ow of a single operation cannot be altered, and furthermore since we need to deal with cyclic graphs as well, the determination of a tight minimum initiation interval is more complex. Consequently, MII as computed by any of the techniques above may not be feasible, even when loop unrolling is allowed. Note that a higher steady-state throughput may sometimes be obtained by unrolling the loop before the modulo scheduling phase. In Example 3.1, the schedule 63

shown in Figure 3.1b achieves dMIIe = dmax(ResMII; RecMII)e = dmax( 53 ; 0)e = 2 since it fully utilizes the three general-purpose functional units of the target machine in the rst row of the MRT. We see, however, that one functional unit is not used in the second row of the MRT; thus if the loop iteration was unrolled six times, one could initiate a new iteration (6 original loop iterations) every 10 cycles instead of 1 original loop iteration every 2 cycles. By unrolling the loop, we would thus achieve a throughput of 0.6 iterations per cycle instead of 0.5. A similar bene t may occur in the presence of fractional RecMII. In this thesis, we focus on nding the best schedule given a particular degree of unrolling and will not further pursue the unrolling of loop iterations to improve the steady-state performance of the schedule. Thus, we consider a schedule to have optimal throughput when it achieves the maximum feasible throughput for the given dependence graph, machine resources, and unrolling factor.

3.2 Register Requirements of a Modulo Schedule In this dissertation, we assume that each virtual register is de ned by a unique operation and that once its value has been de ned, it may be used by several operations. We also assume that a virtual register is reserved in the cycle where its de ne operation is scheduled, and remains reserved until it becomes free in the cycle following its last-use operation. Thus in our machine model, the additional reserved time beyond the beginning of the last-use operation cycle is one clock cycle. Although the machine model has an impact on the formulation of the register requirements, the technique presented in this chapter can be directly adapted to other models as well. Other machines, for example, may have an additional reserved time which is 0,  2, or operation dependent. The lifetime of a virtual register is the set of cycles during which it is reserved. Consider the loop iteration of Example 3.1. The virtual register lifetimes associated with the schedule of Figure 3.2a are presented in Figure 3.2b. The black bars correspond to the initial portion of the lifetime which minimally satis es the latency 64

a) Schedule (II=2)

b) Lifetimes vr0

Time: 0 ldx ldy 1

vr1

vr2

ldx

*

vr3 Latency portion of the lifetime

ldy *

Additional lifetime

2

Schedule of one iteration

3 4 +

+

5

st

st

6 7

c) MRT

d) Register requirements vr0

0,2,4,6 1,3,5,7

ldx ldy *

vr1

vr2

st

vr3

Integral part Fractional part

+

Figure 3.2: Register requirement of a modulo schedule (Example 3.1). of the de ning functional unit; the grey bars show the additional lifetime of each virtual register, through the end of the rst cycle of its last-use operation. Ideally, the additional lifetimes should not be longer than one cycle; however, because of scheduling constraints, it is not always possible to schedule each use operation immediately after its input operand latencies expire. In this dissertation, we quantify the register requirements of a modulo schedule by computing MaxLive, the maximum number of live values in any single cycle of the loop schedule [82]. MaxLive corresponds to the minimum number of registers required to generate spill-free code. MaxLive is a tight bound, i.e. we can guarantee that there is a feasible register allocation with exactly MaxLive registers, as indicated by following theorem.

Theorem 3.1 We can nd a register allocation for a modulo schedule that requires

no more than MaxLive registers. The resulting register allocation may require unrolling the modulo schedule some nite number of times, using modulo variable expansion [55].

65

Proof. We demonstrate the theorem by considering the trace of operations, t, that

is generated when executing the given modulo schedule an in nite number of times. From the de nition of MaxLive, we know that there are no more than MaxLive live values in any cycle of the operation trace t. Finding a register allocation for t can be e ectuated by coloring the interval graph associated with t. Olariu [71] has shown for interval graphs that a linear-time greedy heuristic is guaranteed to nd a minimum coloring, namely a coloring with MaxLive colors. Moreover, since the operation trace t consists of the operations produced by repetitively executing the same loop schedule, the coloring of t can be examined in every II th cycle once steady-state operation has been reached. Since a nite number of colors is used, the coloring of t in one of these cycles must eventually be identical to the coloring k  II cycles earlier for some integer k. After this point, the coloring of these k  II cycles can be repeated inde nitely and the register assignment that corresponds to this coloring can be used in the software pipelined code if the II cycles of the steady-state portion of the code are unrolled k times. This register allocation uses precisely MaxLive registers. 2 Note that the degree of unrolling required in Theorem 3.1 may be impractical. However, in practice, Rau et al [82] have shown that register allocation algorithms for modulo schedules can achieve register allocations that are within one register of the MaxLive lower bound for the vast majority of their modulo scheduled loops without loop unrolling on machines with rotating register les and predicated execution to support modulo scheduling. When such hardware support is not available, they have shown that their register allocation algorithm typically achieves register allocations that are within four registers of the MaxLive lower bound by using the minimum degree of loop unrolling1. We have shown in Section 3.1, Figure 3.1b, that the steady-state resource requirements of a modulo schedule (or MRT) can be simply obtained by collapsing the schedule for one iteration to a table of II rows using wraparound. Similarly, the Because a new iteration is initiated every II cycles, the minimum degree of unrolling, u, is determined by the longest lifetime, i.e. u = d e where l is the length in cycles of the longest lifetime among the loop variant lifetimes [82]. 1

l

II

66

steady-state register requirements can be quickly computed by wrapping the lifetimes for one iteration around a vector of length II . In particular, Figure 3.2d shows the number of live registers in steady-state loop execution and is constructed either by collapsing Figure 3.2b to II rows with wraparound or by replicating Figure 3.2b with a shift of II cycles between successive copies until the pattern in II successive rows repeats inde nitely. Figure 3.2d then displays these II steady-state rows in a compact form. In Figure 3.2d, we see that exactly seven virtual registers are live in the rst row and eight in the second, resulting in a MaxLive of eight. Figure 3.2d employs a di erent shading to di erentiate between two distinct contributions to each virtual register lifetime: the fractional and the integral part. For virtual registers with several uses, only the last use is considered. In our machine model, the additional reserved time is one clock cycle and is included, for simplicity, in the fractional part. The fractional part thus spans the rows in the MRT inclusively from the def operation row forward with wraparound through the last-use operation row. Its length thus ranges from 1 to II and is equal to 1 plus the modulo II distance between these two rows, as shown in Figure 3.2d. The integral part spans the entire MRT exactly s times, where s corresponds to the number of times that the row number of the last-use operation appears in the time interval starting inclusively from the def operation schedule time until, but not including, the last-use operation schedule time. The integral MaxLive is de ned as the number of integral part bars in a row, summed over all virtual registers. The fractional MaxLive is the number of fractional part bars in the row with the most fractional part bars. The total MaxLive is the sum of the fractional and integral MaxLive. We di erentiate here between integral MaxLive and fractional MaxLive because this distinction eases the formulation of the modulo scheduling problems that minimize total MaxLive which are presented in Chapters 4 and 5. Prior to our work, modulo scheduling algorithms based on linear programming or integer linear programming formulations have not directly minimized MaxLive, but have instead minimized approximations of the register requirements. For example, 67

a) Register requirements vr0

vr1

vr2

vr3

Integral part

0,2,4,6 Fractional part

1,3,5,7 MaxLive = 8 b) Buffer requirements vr0

vr1

vr2

vr3

0,2,4,6

Buffer part

1,3,5,7 Buff = integral MaxLive + N = 8 c) Average lifetime requirements vr0

vr1

vr2

vr3 Lifetime part

0,2,4,6 1,3,5,7 Average lifetime = 8

Figure 3.3: Register vs. bu er requirements (Example 3.1). the work of Govindarajan et al [44] and Wang et al [93] approximates registers by conceptual FIFO bu ers, initially proposed by Ning and Gao [70]. Unlike registers, conceptual FIFO bu ers are reserved for an interval of time that is always an integer multiple of II cycles. Figure 3.3 illustrates the di erence between the steady-state register and bu er requirements for the modulo schedule of Example 3.1. Notice in Figure 3.3b the additional part needed to complete the 3  II interval associated with the vr1 bu er. As shown in Figure 3.3b, the bu er requirements associated with a virtual register always corresponds to its integral part plus 1. Thus, the bu er requirements of a schedule, referred to as Bu , is simply integral MaxLive +N , where N is the number of operations with output operands. As a result, minimizing Bu is equivalent to minimizing integral MaxLive. Another approximation of the register requirements found in the literature, and used by Ning and Gao [70] and Dupont de Dinechin [29], approximates MaxLive by computing the average lifetime requirements, referred to as AvgLive, which corresponds to the sum of all the register lifetimes divided by II . This approximation can be visualized by shifting the lifetimes so that they run contiguously with wraparound 68

over the II rows of the MRT, as shown in Figure 3.3c. Note that MaxLive, Bu , and AvgLive are all equal to 8 in this simple example. However, by applying these same constructions to the schedule for Example 3.2 in Figure 3.5 results in MaxLive=Bu =6 and AvgLive=5. Similarly, Figure 3.6 for Example 3.3 yields MaxLive=AvgLive=5 and Bu =6. Thus, theses metrics can result in quantitative di erences for a given schedule. Only the minimum MaxLive objective function is guaranteed to minimize the actual register requirements. Experimental evidence presented in Chapters 4 and 5 indicates the bene t of precisely formulating the register requirements by using the minimum MaxLive objective function.

3.3 Scheduling Complexity To illustrate the size of the modulo scheduling space, and thus provide an indication of the computational complexity of nding a schedule with maximum throughput over all modulo schedules and minimum register requirements among such schedules, this section derives the number of distinct MRTs that can be generated for a given loop on a given machine, focusing on machines with fully-pipelined functional units that may be either general-purpose or specialized. Let us rst consider the number of distinct MRTs of II rows that can be generated when scheduling n operations on a machine with m fully pipelined general-purpose functional units. Since the machine has m fully pipelined functional units, each row of the MRT has between 0 and m operations; we thus consider a partition of the II rows of the MRT into components of size i0; i1; i2 : : : im, where ix is the number of MRT rows with x operations. A partition is valid if the sum of the rows in the partition is equal to II (i.e. i0 + i1 + i2 +    + im = II ) and if the sum of the number of operations in each row of each partition is equal to n (i.e. 0  i0 + 1  i1 + 2  i2 +    + m  im = n). Given a valid partition i0; i1; i2 : : : im, the number of row permutations is de ned by the multiset permutation formula [16]:

II ! i0 ! i 1 !    i m ! 69

(3.2)

and corresponds to the number of distinct permutations of the MRT rows that can be obtained with the given partition, where two rows with the same number of operations are not considered distinct. Equation (3.2), however, does not take into account the permutations of the operations among these rows. For a given partition and row permutation, the number of permutations of the n operations is n! if all permutations of the operations are taken into account. If we do not consider the permutations of operations within the same MRT row to be distinct, we obtain the following number of distinct operation permutations: (0!)i0

(1!)i1

n! n! = i i i m 2 2 (2!)    (m!) (2!) : : : (m!)im

(3.3)

where, for example, (2!)i2 eliminates the operation permutations for the operation pair in each of the i2 MRT rows that have exactly two operations. Thus, the total number of distinct permutations for a given row partition is obtained by multiplying Equations (3.2) and (3.3). By summing the total number of permutations over each valid partition, we obtain P (II; n; m), the number of distinct permutations for an MRT of II rows generated by scheduling n operations on a machine with m fully pipelined functional units:

X II ! n! P (II; n; m) = P i m i =II i0! i1 !    im ! (2!) 2 : : : (m!)im Pmx=0 xixx=n

(3.4)

x=0

Finally, for a machine with general-purpose functional units , we obtain the number of distinct MRTs by dividing P (II; n; m) by II since we consider the II rotations of a single MRT as equivalent, namely:

GMRT = P (II;IIn; m)

(3.5)

For machines with specialized functional units and no alternative operations (i.e. each operation can be executed by exactly one functional unit), the number of distinct 70

MRTs is obtained as follows:

Y SMRT = II1 P (II; nf ; mf ) f 2F

(3.6)

where F is the set of all functional units, nf is the number of operations in the loop that use functional unit f , and mf is the number of fully pipelined functional units of type f . We now illustrate the number of distinct MRTs for the innermost loop of a simple kernel of the Livermore Fortran Kernels: DO 7 l = 1, Loop DO 7 k = 1, n X(k) = U(k) + R * (z(k) + R*Y(k)) + T * (U(k+3) + R * (U(K+2) + R*U(k+1))) + T * (U(k+6) + R * (U(K+5) + R*U(k+4))) 7 CONTINUE

Figure 3.4: Kernel 7 of the Livermore Fortran Kernels. The innermost loop of the kernel shown in Figure 3.4, after eliminating redundant load operations across consecutive loop iterations, can be coded with 4 load/store operations, 8 add operations, and 7 mult operations. Since there are no loop-carried dependences in the innermost loop iteration, RecMII = 0 and the minimum initiation interval is uniquely determined by the resource constraints. Consider rst a machine with general-purpose function units. The numbers of distinct MRTs for this kernel (n = 19) with various numbers of general-purpose functional units (m = 3; 6; 8; 10; 19) are computed by using Equation (3.5) and listed in Table 3.2. We can see that there are already more than 24 billion distinct MRTs for a MRT of 4 rows, a kernel of 19 operations, and a general functional unit machine issuing up to 6 operations per cycle. Similarly, the number of distinct MRTs for machines with specialized functional units can be computed by Equation (3.6), as shown in Table 3.3. 71

FUs II number of MRTs (GMRT ) 3 7 14,340,021,696,000 6 4 24,280,264,008 8 3 218,658,726 10 2 92,378 19 1 1

Table 3.2: Number of MRTs for 19 operations with general-purpose functional units. load/store

FUs 1 2 2 2 4

add

FUs 1 2 3 4 8

mult

FUs II number of MRTs (SMRT ) 1 8 341,397,504,000 2 4 323,870,400 3 3 31,752,000 4 2 14,700 7 1 1

Table 3.3: Number of MRTs for 19 operations with specialized functional units. These large combinatorial numbers are a practical indication of the size of the modulo scheduling search space. These large numbers are a serious concern because the problem of selecting a minimum cost MRT has been shown to be NP-complete [48][55].

3.4 Algorithmic Approach In our work, in an e ort to reduce the extreme complexity evidenced in Section 3.3, we view modulo scheduling as a three step procedure with di erent objectives for each step. Some algorithms treat each step separately; e.g. Eisenbeis's software pipelining approach [40] uses distinct heuristics for Steps 1, 2, and 3 below. Other algorithms combine steps to optimize some combination of their distinct objectives; e.g. Hu 's 72

lifetime-sensitive modulo scheduler [49] or Rau's Iterative Modulo Scheduler [78] combines Steps 1 and 2 below. 1. MRT-scheduling primarily addresses resource constraints and is best implemented by using a modulo reservation table (MRT) [48][73] which contains II rows, one per clock cycle, and one column for each resource at which con icts may occur. Filling the MRT consists of packing the operations of one iteration within the II rows of the MRT to obtain a schedule with no resource con icts. As steady-state throughput is simply the reciprocal of II , choosing the minimum feasible II achieves the highest possible steady-state performance. 2. Stage-scheduling primarily addresses dependence constraints, which specify that the distance of each pair of dependent operations is no less than the latency for such a pair. These constraints are satis ed by delaying each operation in the MRT by some (integer) multiple of II and associating these delayed operations with a single stage of the loop. Note that there may not be any feasible stage schedule for a given MRT-schedule because of some critical recurrence cycles in the dependence graph. In this case, Step 1 must generate a new MRT-schedule, possibly with a larger initiation interval. 3. Register allocation performs the actual allocation of virtual registers to physical registers, a scheme that may vary depending on the hardware support available, e.g. rotating register les and support for predicated operations, and the desired, or maximum permitted, level of loop unrolling and loop peeling [82][55][38]. In Chapter 4 we focus on register-sensitive stage-scheduling algorithms that perform Step 2 in isolation, for which we investigate both optimal algorithms and ecient heuristics. The rst advantage of the stage scheduling approach is that its scheduling space is signi cantly reduced, relative to the combination of MRT-scheduling and stage-scheduling, thus permitting searching for stage schedules with the minimum register requirements among all stage schedules for a given MRT. This approach, although suboptimal, works well for very large loops, e.g. loops with more than 150 73

operations. In comparison, the more general modulo scheduling algorithms handling Step 1 and 2 jointly may ultimately nd better solutions, but can only handle medium sized loops (e.g. loops with up to 41 operations and initiation intervals up to 118 cycles) in a reasonable amount of computation time (e.g. up to 15 minutes per loop). Second, we will show that fast and e ective stage-scheduling heuristics can be developed as these heuristics can focus solely on reducing the register requirements for a given MRT. This contrasts to more general modulo scheduling heuristics [49][57][58] that reduce the register requirements while simultaneously considering potentially con icting objectives, such as satisfying the resource constraints, scheduling operations well along critical dependence cycles, maximizing the throughput of the modulo schedule, and minimizing the schedule length of the modulo schedule. In Chapter 5 we focus on scheduling algorithms that perform Step 1 and 2 jointly, for which we investigate an optimal algorithm that results in a schedule with the highest steady-state throughput over all modulo schedules, and the minimum register requirements among all maximum steady-state throughput schedules. The advantage of this approach is that it may result in lower register requirements as it can explore the full space of MRT schedules and their stage schedules. In this dissertation, we do not explore new techniques for Step 3, nor do we integrate Step 3 with the two previous steps. Step 3 is easily separated from Steps 1 and 2 without any signi cant penalty. A substantial separate body of research has investigated this issue [82][46][39][6], and the results of that work can be applied here as well.

3.5 Dependence Graph and Transformations We represent a loop by a dependence graph G = fV , Esched , Ereg g, where the set of vertices V represents operations and the sets of edges Esched and Ereg correspond, respectively, to the scheduling dependences and the register dependences among operations. A scheduling edge enforces a temporal relationship between a pair of dependent operations or between a pair that cannot be freely reordered, such as 74

load and store operations to ambiguous memory locations. A scheduling edge from operation i to operation j , w iterations later, is associated with a latency li;j and a loop-carried dependence distance !i;j = w. A register edge corresponds to a data ow dependence carried in a register. In general, there are scheduling edges that have no corresponding register edge, e.g. the dependence between a store and a load to an ambiguous (or the same) memory location results in a scheduling edge but not in a register edge. However, it is generally assumed that each register edge is also a scheduling edge. In this section, however, we illustrate the conditions under which redundant edges in the dependence graph can be removed from either the scheduling or register edge set without modifying the scheduling space of register-sensitive modulo schedulers. Removing edges has several bene ts and is extensively used in Chapters 4 and 5. Using this technique, for example, the optimal stage-scheduler investigated in Chapter 4 completes in 53.5% of its original time, over our benchmark suite of 1327 loops, when redundant edges are eliminated. Also, redundant edges clutter the dependence graph, thus preventing ecient register-sensitive scheduling heuristics from detecting situations in which increased register requirements can be simply eliminated. Typically, it is known that transitive scheduling edges of a single basic block can be ignored when scheduling that basic block. In this section, we extend this result in two directions. First, we handle dependence graphs with arbitrary dependence distances. Second, we remove redundant scheduling edges and redundant register edges. Removing redundant register edges is particularly important since they signi cantly impact the run times of register-sensitive schedulers and can confuse heuristics that may be used. Formally, we remove redundant edges of the original graph G = fV , Esched , Ereg g 0 , E 0 g, with fewer redundant edges. Edges to get a new graph G0 = fV , Esched reg 0 are removed from G if and only if we can prove that scheduling for G or G0 is guaranteed to result in de ning the same scheduling search space with the same register requirements. While Ereg was de ned as a subset of Esched , this inclusion property does not hold in G0, as some edges may be removed from Esched to form 75

0 . 0 without removing the corresponding edges, if any, from Ereg to Ereg Esched a) Dependence graph

b) Schedule (II=2)

c) Lifetimes vr0

Time: 0 ld

1

vr1

+

+

*

*

2

+

vr2

ld

ld

vr0

vr1

3 * vr2

4 5

st

st

6

st

7

Figure 3.5: Modulo schedule for Example 3.2. Example 3.2 This example illustrates a case where two redundant edges can safely

be removed from G0. This kernel is: y[i] = x[i] *(x[i] +a) as shown in the dependence graph of Figure 3.5a. This example is a basic block with no loop-carried dependence. In this example, the mult operation is guaranteed to be scheduled no earlier than one cycle after the add operation. As a result, the scheduling edge (load, 0 because enforcing the dependence constraints mult) can be safely removed from Esched of scheduling edges (load, add) and (add, mult) is necessarily more constraining than enforcing the dependence constraints of scheduling edge (load, mult). Similarly, the 0 since the mult is the last-use register edge (load, add) can be removed from Ereg operation of vr0 which implies that the register lifetime generated by the (load, add) edge is necessarily included in the lifetime generated by the register edge (load, mult).

Example 3.3 This example shows that the notion of transitive edges cannot be ex-

tended directly to dependence graphs with loop-carried dependences. This kernel is: y[i] = x[i] *(x[i-1] +a) as shown in the data ow dependence graph of Figure 3.6a. This example di ers from Example 3.2 only by the dependence of distance 1 associated with the edge (load, add); ! = 1 is indicated by the \[1]" label on this edge in Figure 3.6a (edges without such labels have ! = 0). Figure 3.6b presents a schedule 76

a) Dependence graph

b) Schedule (II=2)

c) Lifetimes vr0

ld [1]

Time: 0 vr0

1

ld

+

vr2 + *

*

+

+

2 vr1

ld

vr1

3

* vr2

4 st

st

5

st

Figure 3.6: Modulo schedule for Example 3.3. for this kernel where the add operation is scheduled at time 0, a valid schedule since the add operation input is generated by the load operation of the previous iteration. Figure 3.6c illustrates the lifetimes associated with this schedule. Since the result of the load operation is used both by the mult operation (time 1) and the add operation of the next iteration, (time 0+ II = 2), the lifetime of vr0 lasts through time 2. This schedule shows that the lifetime associated with the register edge (load, add) is not necessarily included in the lifetime associated with the register edge (load, mult), even in presence of a scheduling edge from the add operation to the mult operation. As a result, the register edge (load, add) cannot be removed from G0 . To determine the conditions under which edges can be safely removed, we introduce the MinDist relation as presented by Hu [49]. MinDist(x; y) represents the minimum number of cycles, possibly negative, that must occur between the time that operation X is scheduled until the time that operation Y is scheduled, if there is a path in the dependence graph from X to Y . Otherwise, MinDist(x; y) is de ned as minus in nity. Computing MinDist is an all-pairs longest-path problem where each dependence edge (i; j ) is assigned a length of li;j ? !i;j  II , the minimum schedule distance from operation i to operation j , !i;j iterations later. MinDist is guaranteed to converge to a solution if the sum of the lengths along each cycle in the dependence graph is nonpositive [49]. Otherwise MinDist(x; y) would have an in nite value for each operation pair, x and y, that is connected by a path including a cycle with strictly positive length, since the longest path from x to y would include this cyclic path 77

with strictly positive length an in nite number of times. Fortunately, the minimum initiation interval due to recurrences (RecMII) precludes this situation, namely II  RecMII guarantees that the sum of the li;j ? !i;j  II length along each cycle is nonpositive, as otherwise the scheduling constraints of such a cycle could not be simultaneously satis ed [49].

Theorem 3.2 (Edge Removal Test) Consider the operations A, B , and C in V

with edges (a; b) and (a; c) in Esched and Ereg of G. When the following inequality holds:

(!a;c ? !a;b)  II + MinDist(b; c)  0 we can guarantee that if the edge (a; b) is a scheduling edge in graph G0 , the edge (a; c) can be removed from the scheduling edge set of graph G0 . Similarly, we can guarantee that if the edge (a; c) is a register edge in graph G0 , the edge (a; b) can be removed from the register edge set of graph G0 . d) Dependence graph with replicated operations

b) Graph G:

a) Dependence graph

Esched = {(a,b),(a,c)} E reg

A

= {(a,b),(a,c)}

A0

vr0 [i]

[j]

vr0

B0

c) Graph G’ (if test succeeds):

>=MinDist(B,C) i*II

E’ sched = {(a,b)}

B

E’ reg

C0

= {(a,c)}

C

Bi register edge scheduling edge scheduling distance

j*II Cj

Figure 3.7: Removing edges. Figure 3.7a illustrates Theorem 3.2. In this partial dependence graph, the value produced by operation A is used i iterations later by operation B (!a;b = i) and j iterations later by operation C (!a;c = j ). The sets of edges associated with this dependence graph are presented in Figure 3.7b. When the inequality of Theorem 3.2 78

holds, the sets of edges shown in Figure 3.7c are guaranteed to result in de ning the same scheduling search space with the same register requirements.

Proof. Consider the dependence graph with replicated operations shown in Fig-

ure 3.7d, which corresponds to the same dependence graph as illustrated in Figure 3.7a. In this graph, operations A0, B 0, and C 0 are operations of the rst iteration, operation Bi is an operation of iteration i, and operation Cj is an operation of iteration j . As de ned in Figure 3.7a, the result of operation A0 is used by operations Bi and Cj . Several scheduling distances can be inferred from Figure 3.7d. First, the modulo constraint guarantees that operation Bi is scheduled exactly i  II cycles after B 0. Similarly, operation Cj is guaranteed to be scheduled j  II cycles after C 0. We can also use the minimum distance relation to assert that operation C 0 must be scheduled at least MinDist(b; c) cycles after operation B 0. We now demonstrate the conditions that must hold to guarantee that operation Cj is scheduled no earlier than operation Bi. Without loss of generality, we state that operation Bi is scheduled at time tB0 + i  II . Summing the scheduling distance from operation B 0 to Cj , we can guarantee that operation Cj can be scheduled no earlier than tB0 + MinDist(b; c)+ j  II . Using these results, we conclude that operation Cj is scheduled no earlier than operation Bi if the di erence of their schedule times is nonnegative, namely if (tB0 + MinDist(b; c) + j  II ) ? (tB0 + i  II )  0. Simplifying the tB0 terms and substituting !a;b and !a;c for i and j , respectively, we obtain the inequality of Theorem 3.2. Note that this inequality is independent of tB0, the time at which operation B 0 is scheduled. When operation Cj is guaranteed by the edge-removal test to be scheduled no earlier than operation Bi, we know that the scheduling edge (a; c) is guaranteed to be ful lled and can therefore be removed. Similarly, we know that the register requirements of edge (a; b) are satis ed by register edge (a; c). 2

79

Algorithm 3.1 (Edge Removal) For a given dependence graph G = fV; Esched ; Ereg g 0 ; E 0 g that and initiation interval II , we construct an equivalent graph G0 = fV; Esched reg is guaranteed to result in de ning the same scheduling search space with the same register requirements and fewer edges as follows: 1. Initialize G0 = G and compute the MinDist matrix for the given II . 2. Apply Theorem 3.2 to each pair of outgoing edges of a vertex v, removing re0 0 when feasible. dundant edges in both Esched and Ereg 3. Repeat Step 2 for each vertex of the graph.

The worst case complexity of Algorithm 3.1 is quadratic in the number of edges, as the algorithm compares pairs of outgoing edges for each operation in the dependence graph.

3.6 Related Work Scheduling techniques that exploit the instruction level parallelism available in loop code have been extensively researched. In this section, we present some of the approaches that have recently been published in this area. A more complete and historical description of this area can be found in [79]. Speci c work on optimal schedulers for loop code based on linear programming and integer programming models is omitted from this section and is presented in detail in Chapters 4 and 5. This general area of research may be divided into two broad categories. One category includes the acyclic scheduling techniques that generally rely on unrolling the body of the loop some number of times and applying global scheduling techniques on the unrolled body of the loop. The advantage of this approach is that robust and ecient scheduling algorithms can be used to schedule a wide range of loops, including loops with conditional statements, early exits, and procedure calls [42][50][59][61]. A drawback of this approach, however, is that the instruction level parallelism is only extracted within one instance of the unrolled loop body. The resulting performance 80

degradation can be alleviated by increasing the degree of unrolling at the cost of increased code size. The other category includes the software pipelining techniques that exploit the instruction level parallelism present in a loop by overlapping the execution of consecutive loop iterations. The advantage of this approach is that the increased exibility in the scheduling domain may result in loop schedules with the highest achievable performance. This approach was named by Charlesworth [22] for its analogy to hardware pipelines and was formulated by Rau [80] for simple loop iterations without conditional statements. Since then, a large body of research has sought ecient software pipelining algorithms that achieve high performance code for a wide range of loops, e.g. loops with conditionals, while loops, and nested loops. One approach to software pipelining is modulo scheduling, which was pioneered by Rau, Glaeser, and Picard [80][81]. In these rst papers, the constraints imposed by the modulo scheduling technique were de ned, and the lower bounds on the initiation interval were proposed. This work was later extended by Rau and others for machines with complex resource requirements and for loop iterations with conditional statements [26][27][49][77][78]. They developed an iterative scheduling algorithm that could schedule and unschedule operations to result in high-performance schedules even in the presence of complex resource requirements. Conditionals in the loop iterations were handled by using IF-conversion [4][72] to replace each operation in conditional statements with a predicated operation [48], i.e. an operation that completes normally when a logical expression, referred to as a predicate, evaluates to true; otherwise, the predicated operation is transformed into a no-op and has no side e ect. Handling conditionals using predicated operations requires special hardware and may result in suboptimal schedules, as the operations of both the then and the else statements contribute to the resource requirements and the critical dependence cycles of the loop iterations. To address this performance issue, Lam proposed a hierarchical reduction technique [55] that schedules the operations of then and else statements separately, and then treats the operations of both partial schedules as a unique pseudo operation that consumes the union of the resource requirements of both partial sched81

ules. After completion of the modulo scheduling and hierarchical reduction processes, the resulting schedule is expanded to replace the pseudo operations by the operations of the then and else statements. During this process, a schedule with m overlapped if-then-else statements results in 2m control ow paths connected by the appropriate branching and merging between the paths. This approach does not require special hardware, such as support for predicated operations, but may result in signi cant code expansion due to the combinatorial expansion of the overlapped conditional statements. Another alternative technique was proposed by Warter et al to eciently handle conditional statements without hardware support [94][95]. This technique, named enhanced modulo scheduling, schedules the operations of a loop as if the machine supports predicated operations. It uses IF-conversion to remove the conditional statements in the loop iteration and search for a modulo schedule. In this approach, two operations may share a unique resource provided that the operations are guarded by predicates with disjoint values, i.e. the operations cannot both execute in a legal execution path of the original ow graph. After the completion of the modulo scheduling process, predicated operations are expanded into an equivalent control ow graph using a REVERSE-IF-conversion mechanism. The advantage of enhanced modulo scheduling over the hierarchical reduction approach is that operations of conditional statements can be scheduled more freely since operations are not part of some hierarchical structure and thus may be scheduled and unscheduled individually regardless of the original control ow graph. Experimental evidence gathered by Warter [94] indicates that the enhanced modulo scheduling technique achieves lower initiation intervals. The modulo scheduling approach was also extended for loops with conditional exits, such as while loops [87][89]. Furthermore, several optimizations were proposed to successfully increase the available instruction level parallelism for modulo schedules, such as elimination of redundant loads and stores across loop iterations [27] as well as transformations to reduce the critical paths caused by data dependence [85] or control dependence [87]. Recently, optimizations based on unrolling a loop body 82

before applying a modulo scheduling algorithm have also been investigated [56]. Techniques based on Petri nets have also been proposed to determine a suitable degree of unrolling for modulo schedules [5]. Because high-performance schedules result in increased register requirements, research was also conducted on developing ecient register-sensitive modulo scheduling algorithms. Hu has investigated a heuristic based on a bidirectional slack-scheduling method that schedules operations early or late depending on their number of stretchable input and output ow dependencies [49]. Llosa et al have recently proposed two heuristics that are based on a bidirectional slack-scheduling method with a scheduling priority function tailored to minimize the register requirements [57][58]. A detailed comparison of these approaches with our approach for register-sensitive modulo scheduling is provided in Chapter 4. Register allocation algorithms for modulo schedules have been investigated by Rau et al [82]. Their algorithm achieves register allocations that are within one register of the MaxLive lower bound for the vast majority of their modulo-scheduled loops on machines with rotating register les and predicated execution to support modulo-scheduling. By using loop unrolling, with its consequent code expansion, their algorithm typically achieves register allocations that are within four registers of the MaxLive lower bound on machines without such hardware support. Hendren et al [46] have also investigated register allocation algorithms tailored for modulo scheduling and based on cyclic interval graphs. They have also proposed spilling heuristics. Eisenbeis et al [39] have also contributed a register allocation technique based on cyclic graphs. All the modulo scheduling techniques presented above have relied on the original modulo constraints that enforce a unique initiation interval among consecutive loop iterations. However, this assumption may result in suboptimal schedules in the presence of unbalanced conditional statements in a loop body since loop iterations are initiated every II cycles regardless of whether a short or long conditional statement is executed. This performance issue was recently addressed by Warter-Perez and Partamian [96] where the modulo scheduling approach was extended to multi83

ple initiation intervals for machines supporting predicated operations and compound predicates [96]. They report performance improvements of up to 25%, 5% on average, for a small benchmark of loops from the Perfect benchmark suite. Several alternative approaches to modulo scheduling have been proposed to address the above limitation, i.e. suboptimal performance due to a unique initiation interval per loop schedule. One approach, proposed by Aiken et al [3], constructs a software pipeline schedule by scheduling operations of several consecutive iterations and testing for repeating states until a steady-state pattern is found along all paths. This approach handles loops with arbitrary control ow graphs, machines with nite resources, and operations with simple resource patterns. Its main advantage is that it does not restrict the space of software pipelining as much as modulo scheduling does and thus may result in better schedules, particularly for loops with unbalanced conditional statements. However, the lack of constraints results in an algorithm that is more dicult to engineer. For example, the algorithm critically relies on the number of loop iterations considered at any instant of time during the scheduling process: too low a number may result in a schedule with little overlap among consecutive iterations and too large a number may result in an excessively large code size. Also, this algorithm relies on recognizing repeating states which may result in large memory and computational overhead. Another approach to software pipelining, developed by Ebcioglu et al of the IBM VLIW project and named enhanced pipeline scheduling, also investigates schedules with multiple initiation intervals [52][67]. It is based on a global acyclic scheduling algorithm that is repetitively applied to a loop iteration. Since, in general, the dependence graph of a loop is cyclic, the algorithm introduces a fence (i.e. a cut edge set) to yield an acyclic dependence graph. By moving the fence at each scheduling step, the algorithm succeeds in overlapping the execution of consecutive operations. Advantages of this approach are that it works for general control ow graphs and results at each step in a valid schedule. Thus conventional optimization techniques can be used at each step to improve the performance of the resulting code. 84

A technique based on repeatedly transforming a schedule to a more compact schedule using a mechanism analogous to retiming has also been investigated, by Chao et al [21] for example, and was found to result in schedules with good performance, i.e. small initiation interval and short schedule length. These results were obtained for arbitrary dependence graphs on machines with nite resources and simple resource reservation tables.

85

CHAPTER 4 Stage Scheduling Algorithms As described in Chapter 3, we treat modulo scheduling as a three step procedure: (1) MRT-scheduling which produces a Modulo Reservation Table (MRT); (2) stage-scheduling which assigns each operation in a given MRT to a stage; and (3) register allocation which actually allocates virtual registers to physical registers. In this chapter, we investigate an optimal stage-scheduling algorithm for Step 2 that minimizes the register requirements, as well as ecient stage-scheduling heuristics that reduce the register requirements. Each algorithm presented here assumes that a valid MRT is given as input and produces a valid stage schedule with the minimum, or heuristically reduced, register requirements among all valid stage schedules for the given MRT. We rst present an algorithm that performs Step 2 in an optimal fashion, minimizing MaxLive, the minimum number of registers required to generate spill-free code [49]. This optimal algorithm, referred to as MinReg Stage-Scheduler, proceeds in linear time (in the number of edges) for loops whose dependence graphs have acyclic underlying graphs1 . Otherwise, the optimal algorithm uses a general linearprogramming approach to handle general dependence graphs with unrestricted loopcarried dependences and common subexpressions. We can also quickly determine when the former (linear-time) method is applicable. 1

The underlying graph is an undirected graph formed by ignoring the direction of each arc.

86

We then present a set of stage scheduling heuristics that reduce the register requirements of a modulo schedule by reassigning some operations to di erent stages, i.e. by shifting operations by multiples of II cycles. These heuristics result in an optimal stage schedule for loops with acyclic underlying dependence graphs and, otherwise, achieve at a signi cantly lower computational cost a signi cant part of the decrease in register requirements obtained by the MinReg Stage-Scheduler. The computational complexity of the heuristics investigated ranges from linear to quadratic in the number of edges in the dependence graph. These heuristics address a common shortcoming of current register-insensitive modulo schedulers which tends to result in schedules with increased register requirements for operations not on the critical path. We may also constrain the scheduling space of stage scheduling to guarantee that the overall performance of an input modulo schedule is maintained, or increased, while searching for a stage schedule with lower register requirements. This technique is applicable to both the optimal algorithm and the heuristics for stage scheduling and allows us to obtain a register-sensitive modulo schedule without incurring any performance penalty. Both the optimal algorithm and the heuristics for stage scheduling presented in this chapter handle loop iterations that consist of a single predicated basic block with general dependence graphs. These algorithms nd schedules that satisfy the dependence constraints on machines with nite resources. When computing the register requirements of a schedule containing predicated operations, we assume that each predicate value is independent of the others, which is worst-case assumption as far as register requirements are concerned. Further reducing the register requirements of schedules by investigating the relations among predicate values is postponed to Chapter 6. We investigate the performance of the optimal stage scheduler, the stage-scheduling heuristics, and other schedulers for a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels for a machine with complex resource usage, the Cydra 5 [11]. Our empirical ndings show that the register requirements decrease by 19.8% on average when applying the optimal stage scheduler 87

to the MRT-schedules of the register-insensitive modulo scheduler employed in our experiments. Our experimental ndings also show that our best heuristic achieves on average 99.0% of the decrease in register requirements obtained by the optimal stage scheduler over the entire benchmark suite. Similarly, our best linear-time heuristic achieves on average 98.8% of the optimum decrease in register requirements. In this chapter, related work is presented in Section 4.1, the concepts of stage scheduling are illustrated in Section 4.2, and our optimal stage-scheduling algorithm is developed in Section 4.3. Heuristics for stage scheduling are presented in Section 4.4. An approach to schedule-length sensitive stage-scheduling is introduced in Section 4.5. Measurements are presented in Section 4.6 and conclusions in Section 4.7.

4.1 Related Work In this section, we present related work in which operations of a modulo schedule are displaced by multiples of II cycles, which we refer to as stage scheduling. We also contrast our stage scheduling heuristics to published work on lifetime-sensitive modulo scheduling heuristics. The concept of displacing an operation or a group of operations by multiples of II cycles in a modulo schedule was rst proposed by Hsu [48]. In his work, loops with general dependence graphs are processed in two steps. First, the strongly connected components of the dependence graph are identi ed and their operations are scheduled with the smallest feasible II using a combinatorial search procedure. Then, the operations in each strongly connected component, as well as operations not in any strongly connected components, are displaced by multiples of II cycles until a modulo schedule that satis es each dependence constraint is found. Stage scheduling was thus employed to nd a modulo schedule with the smallest feasible II based on the partial schedule of operations in strongly connected components. Displacing operations by II cycles to obtain a schedule with shorter schedule length was also used early on by Lee and collaborators from Hewlett Packard Laboratories. 88

A linear-time stage-scheduling algorithm was later proposed by Mangione-Smith et al to minimize the register requirements of loop iterations where each virtual register is used at most once, i.e. for forest-of-trees dependence graphs [63][64]. They also proposed the rst precise model of the register requirements for stage schedules with dependence graphs that are forests of trees. Our optimal stage-scheduler work extends their work in two directions. First, the linear-time method of our MinReg StageScheduler permits multiple uses of each virtual register, provided that the underlying dependence graph is acyclic. Second, the general method of our MinReg StageScheduler handles arbitrary dependence graphs, including loop-carried dependences and common subexpressions. Other modulo scheduling approaches have also investigated a two step approach in which one of the two steps is often similar to stage scheduling. For example, the work of Eisenbeis and Windheiser has investigated an algorithm for determining the optimal unrolling degree necessary to nd a modulo schedule that fully saturates a critical resource of the machine [40]. During the rst step, the optimal unrolling degree is determined and an MRT-schedule is found; during the second step, a stage schedule is found that minimizes the schedule length for the given MRT-schedule. This result was obtained for directed acyclic dependence graphs on machines with simple reservation tables, i.e. each type of operation reserves one resource for a speci able duration. The work of Wang et al [92] has also investigated a two step heuristic to modulo scheduling. Following the publication of our stage scheduler for minimum register requirements [35], Wang et al presented a stage scheduler that minimizes the bu er requirements [93]. As presented in Section 3.2, bu ers only approximate registers in that a bu er must be reserved for a time interval that is an integer multiple of II cycles, whereas registers may be reserved for an arbitrary number of cycles. The results presented in this chapter are more accurate in that we precisely minimize the register requirements of a modulo schedule. Some researchers have investigated optimal solutions for modulo scheduling, solving the MRT-scheduling and stage-scheduling steps jointly, under various simplifying 89

restrictions. For example, the optimal modulo scheduling algorithm presented in Chapter 5 nds a schedule with the highest steady-state throughput over all modulo schedules, and the minimum register requirements among such schedules. By considering all MRTs for a given II , that algorithm generally results in schedules with lower register requirements than the optimal stage scheduling algorithm proposed in this chapter for a given MRT; however, the algorithm of Chapter 5 is extremely computationally intensive. As a result, only medium sized loops can be handled in a reasonable time using that approach. For example, we have used it for loop iterations with up to 41 operations and II up to 118 when an optimal schedule is sought using no more than 15 minutes of computation, whereas for the algorithms of this chapter our benchmark suite contains loop iterations with up to 161 operations and II up to 165. Heuristics for register-sensitive modulo scheduling have been proposed by Hu [49] and Llosa et al [57][58]. These approaches are exible in that they directly generate a modulo schedule, thus potentially generating schedules with lower register requirements. However, several potentially con icting constraints and objectives must be considered at once, such as satisfying the resource constraints, scheduling operations along critical dependence cycles to maximize the steady-state throughput of the schedule, and minimizing of the total schedule length of one loop iteration. A potential drawback of these combined approaches is that the resulting schedules may compromise the overall performance of the schedule in order to achieve lower register requirements, especially for machines with complex resource requirements. This contrasts with our register-sensitive stage-scheduling heuristics that can solely focus on decreasing the register requirements of a schedule without impact on its performance, thus clearly separating primary and secondary objectives into distinct phases. For example, we may rst use a modulo scheduler, such as the Iterative Modulo Scheduler proposed by Rau [77][78], that is highly optimized to generate high performance code (i.e. code with a small initiation interval and a short schedule length) for machines with complex resource requirements. We may then employ our optimal stage-scheduling algorithm or some of the stage-scheduling heuristics to 90

reduce the register requirements. None of these algorithms modify the initiation interval and thus they have no impact on the steady-state performance of the resulting schedule. If the schedule-length sensitive heuristics are used, we can also guarantee that the schedule length will not increase, thus maintaining or increasing the transient performance of the resulting schedule. Thus, a high-performance schedule with low register requirements can be achieved without impact on performance by using a two phase approach, with stage scheduling in the second phase. A potential drawback of the stage scheduling approach is that it is given an MRT, and that the given MRT may not be one that can result in a schedule with the lowest register requirements, i.e. some other MRT with the same II might permit a stage schedule with lower register requirements. Experimental evidence using the optimal modulo scheduling algorithm that is presented in Chapter 5 which minimizes the register requirements over all modulo schedules with a speci ed value of II indicates, however, that the increase in register requirements due to two phase approach is relatively small. This may be explained by the fact that the modulo scheduling heuristic used in our experiment to produce the MRT-schedules minimizes the total schedule length of one loop iteration, and thus naturally tends to produce MRTs with low register requirements. Another concern is that the stage scheduling approach may be less ecient for loops with large II , as it can only move operations by integer multiples of II cycles; measurements presented in Chapter 5 provide an empirical bound on the degradation due to this e ect.

4.2 Stage Scheduling and Register Requirements In this section, we illustrate the impact of stage scheduling on the register requirements with a few examples for the target processor summarized in Table 3.1 with three general-purpose functional units and the register model described in Section 3.2 with an additional reserved time of one cycle.

91

Example 4.1 This simple example, identical to Example 3.1, illustrates the impact of stage scheduling on the register requirements of a modulo schedule. The loop kernel is: z[i] = 2*x[i] + y[i] which is computed as shown in the dependence graph of Figure 3.1a. Figure 3.1b presents an input MRT-schedule for this simple kernel. This MRT is valid since no more than 3 functional units are used in any cycle.

The stage scheduler delays each operation by some integer multiple of II cycles so that it is scheduled only after its input values have been calculated and made available to it by its predecessor operations. By delaying operations only by integer multiples of II , the MRT row in which each operation is placed is unaltered; thus, the resource constraints are guaranteed to be ful lled for any stage schedule associated with a valid MRT-schedule. A stage schedule for Example 4.1 is shown in Figure 4.1c. The circles highlight the operations that belong to the iteration initiated at time 0. The circles must be placed so as to satisfy the speci ed operation latencies. The lifetimes for this iteration are shown in Figure 4.1d (note that the legend shown at the right of Figure 4.2 applies to Figure 4.1 as well). The dependence constraints are satis ed if and only if no circled operation is placed during the portion of the lifetime that overlaps with the latency portion of the lifetime (black bar) of a virtual register that it uses. For example, the add operation uses vr1 and thus cannot be scheduled at time 1 or 3, but can be scheduled at time 5. Since the add operation is placed in Row 1 of the MRT, only those rows in the replicated MRT that are congruent to 1 (mod II ) are considered. The add will thus be delayed by an integer multiple of II , 2  II in this example. Traditionally, a schedule is represented by associating each operation with a schedule time; in this chapter, however, a schedule is represented by associating each dependence edge with a schedule time interval, referred to as the skip factor. This novel representation is critical for expressing the stage scheduling problem eciently. Consider a dependence edge from operation i to operation j , scheduled at timei and timej , respectively. We de ne the skip factor along edge (i; j ) as the number of times that the row of operation j is encountered in the execution trace during the time interval [timei; timej ). In Figure 4.1c, for example, the skip factor along the depen92

a) Dependence Graph

c) Schedule (II=2)

d) Lifetimes vr0

Time: 0

ldx

1

vr0 *

2

ldy

3

ldx ldy st

vr1

+

*

vr2

ldx

vr3

ldy *

ldx ldy st *

+

vr2

vr1

4

+

5

vr3

6 st

7

ldx ldy st *

+

+

st

ldx ldy st *

+

b) MRT

e) Register requirements vr0

Row:0 1

vr1

vr2

vr3

ldx ldy st *

+

Figure 4.1: MRT and stage schedules with load-y scheduled early (Example 4.1). a) Schedule (II=2)

c) Lifetimes vr0

Time: 0 1 2 3 4 5 6 7

ldx ldy st *

+

vr3 Latency portion of the lifetime

*

*

Additional lifetime

+

Schedule of one iteration ldy

ldx ldy st *

+

+

ldx ldy st *

st

+

d) Register requirements vr0

1

vr2

ldx ldy st

b) MRT

Row: 0

vr1

ldx

vr1

vr2

vr3

Integral part

ldx ldy st *

Fractional part

+

Figure 4.2: MRT and stage schedules with load-y scheduled late (Example 4.1). 93

dence edge (mult, add) is 2, as the row of the add operation is encountered twice in the interval [1, 5), once at time 1 and once at time 3. The skip factor is within 1 of the stage di erence used in [64][70], and results in a simpler, but equivalent, formulation.

Formal Problem De nition (Optimal Stage Scheduling). Among all the stage

schedules that employ a given MRT, nd one (i.e. assign the skip factors) that satis es all the dependence constraints and minimizes MaxLive, the maximum number of live values at any single cycle of the loop schedule. Stage scheduling is performed by assigning an integer value to each skip factor, i.e. to each dependence edge, so as to generate a valid stage schedule that minimizes MaxLive. The steady-state register requirements of the schedule presented in Figure 4.1c are shown in Figure 4.1e. Counting the integral and fractional parts in the row with the most fractional parts, we obtain a MaxLive of eight. The stage scheduling approach can, however, reduce MaxLive from eight to six by assigning the load-y operation to stage 2 (at time 4) instead of stage 0 (at time 0). This improved schedule is shown in Figure 4.2. The stage schedule presented in Figure 4.2 results in the minimum register requirements for this kernel, MRT, and set of functional unit latencies. Mangione-Smith et al [64] have shown that for dependence graphs with at most a single use per virtual register, as in Example 4.1, minimizing each skip factor individually always results in the minimum register requirements. However, this result does not apply to general dependence graphs with unrestricted common subexpressions and loop carried dependences. The next example illustrates such a dependence graph in which minimizing each skip factor individually does not result in the minimum register requirements.

Example 4.2 Our second kernel is: y[i] =

, where the value of x[i] is read from memory, squared, decremented by x[i]+a, and stored in y[i], as shown in the dependence graph of Figure 4.3a. Figure 4.3b illustrates an MRT-schedule for the kernel of Example 4.2 and Figure 4.3c presents a stage schedule in which each operation is scheduled in its earliest stage. Although the MRT-schedule shown in 94

x[i]2-x[i]-a

a) Dependence graph

ld



1

st

+

+

2

ld



vr2

3

st

+

4

ld



5

st

+

6

ld



7

st

+

ld

Time: 0

vr0 * vr1

c) Schedule (II = 2)

− vr3 st

d) Lifetimes vr0 ld

*

st

vr3

*

*

*



*

st

e) Register requirements vr0

1

vr2

+

b) MRT

Row: 0 ld

vr1



vr1

vr2

vr3

*

+

Figure 4.3: MRT and stage schedules with add scheduled early (Example 4.2). a) Schedule (II = 2)

c) Lifetimes vr0

Time:0

ld



1

st

+

2

ld



3

st

+

4

ld



5

st

+

6

ld



7

st

+

*

1

st

vr3 Latency portion of the lifetime Additional lifetime Schedule of one iteration

* + −

*

st

d) Register requirements vr0



vr2

*

*

b) MRT

Row: 0 ld

vr1 ld

vr1

vr2

vr3

Integral part

* Fractional part

+

Figure 4.4: MRT and stage schedules with add scheduled late (Example 4.2). 95

Figure 4.3b is constructed to illustrate our point concisely, experimental evidence indicates that similar situations do occur in our benchmark suite. Notice the two distinct paths between operations load and sub in the dependence graph. Because we represent a stage schedule by associating time intervals with dependence edges, we must ensure that the sum of the time intervals along the path load-mult-sub is equal to the sum along the path load-add-sub. In general, there are distinct paths among operations if there is a cycle in the underlying dependence graph, i.e. if there is a cycle in the dependence graph where the directions of the arcs are ignored. For each cycle in the underlying graph, we must ensure that the cumulative time intervals on a path that traverses the cycle is zero. The cumulative distance of a path corresponds to the algebraic sum of the time intervals associated with the edges along that path, where a time interval is taken as negative when the edge is traversed in the reverse direction. Subsequently, we refer to the elementary cycles in an underlying dependence graph as underlying-cycles, to the cumulative distance along a counter-clockwise closed path of an underlying cycle as the cumulative distance of an underlying cycle, and to the new constraints associated with these cycles as underlying-cycle constraints. We present in detail how to derive the underlyingcycle constraints in Section 4.3.1. In Figure 4.3, the add operation is scheduled in the stage 0 (at time 1). However, we can also schedule the add later (in stage 2 or 3, namely at time 3 or 5) without delaying the sub operation since the result of this operation must be kept alive through time 6 in any case. In Figure 4.4, the add operation is scheduled at time 5. By comparing Figures 4.3e and 4.4d, we see that scheduling the add operation later increases the lifetime of vr0 by 1 12 columns, but decreases the lifetime of vr2 by 2 columns. As a result, MaxLive is reduced from 9 to 8 by scheduling the add operation late. This result is surprising since we might expect the lifetime increase of vr0 to match the 2-column lifetime decrease of vr2. However, with an early add, the mult is the last use of vr0 and the rst cycle of additional add delay (during time 1) does not increase the vr0 lifetime. 96

Example 4.2 shows that scheduling each operation as early as possible does not necessarily result in minimum register requirements for dependence graphs that have cycles in their underlying graphs, and that a more global stage scheduling algorithm is needed when underlying cycles are present. To summarize the ndings of this section, we have seen that stage scheduling can impact the register requirements for a given kernel, MRT, and set of functional unit latencies. A stage scheduler must enforce the dependence constraints, scheduling each operation only after all its input operands are available. In the presence of cycles in the underlying dependence graph, the stage scheduler must also insure that the cumulative distance of each underlying-cycle is zero. Since underlying-cycles can interact with one another, minimizing the register requirements can only be achieved for general dependence graphs by reconciling these interactions globally.

4.3 Optimal Algorithm for minimum MaxLive In Section 4.3.1, we rst present an algorithm that minimizes the integral part of the register requirements. In Section 4.3.2, we extend this algorithm to minimize total MaxLive. Inclusion properties that decrease the number of problems investigated when minimizing total MaxLive are presented in Section 4.3.3.

4.3.1 Scheduling for Minimum Integral MaxLive We develop in this section an algorithm that nds a stage schedule resulting in the minimum integral MaxLive for a given kernel, MRT, and set of functional unit latencies. We introduce the variables that characterize a stage schedule and present the conditions that de ne a valid schedule. All variables presented in this section are summarized in Table 4.1. We omit here the fractional MaxLive because its behavior is highly nonlinear. This omission results in stage schedules that require no more than one additional register per virtual register that is used multiple times including at least one use by an operation in an underlying-cycle. However, our nal algorithm in Section 4.3.2 97

Machine: (input) li;j latency between operation i and operation j . Graph: (input) G = fV; Esched ; Ereg g dependence graph with scheduling edge set and register data ow edge set. UG set of underlying cycles in the underlying graph of the dependence graph GfV; Esched [ Ereg g. !i;j dependence distance (in iterations) between operation i and operation j . MRT schedule: (input) rowi row number in which operation i is scheduled. II initiation interval. Results: (output) si;j minimum (integer) skip factor necessary to hide the latency li;j of the scheduling edge (i; j ). pi;j additional (nonnegative integer) skip factor used to postpone operation j further.

Table 4.1: Variables for the stage scheduling model. will take both the integral and the fractional MaxLive into account. We represent a loop iteration, its dependence graph, and the machine latencies using the same notation as in Section 3.5. Recall that the dependence graph is represented by G = fV , Esched , Ereg g, where V is the set of operations, Esched corresponds to the set of scheduling dependences and Ereg corresponds to the set of register dependences among the operations. Furthermore, an edge from operation i to operation j , w iterations later, is associated with a latency li;j and a dependence distance !i;j = w. We characterize the initial MRT-schedule by its initiation interval II and by the row of the MRT in which each operation is placed. We refer to the row in which operation i is placed as rowi . We also de ne the following distance relation between operations i and j :

rdisti;j

distance from the row of operation i to the next row of operation j , possibly in the next instance of the MRT. 98

which may be computed as follows:

rdisti;j = dist(rowi; rowj ) = (rowj ? rowi) mod II

(4.1)

Note that since the MRT is given and unchanged by the stage scheduler, rowi and rdisti;j are simply constants derived from the input MRT. Using these terms, and referring to the time at which operation i is scheduled as timei, we may de ne the time interval along an edge from operation i to operation j as:

timej ? timei = rdisti;j + skipi;j  II

(4.2)

where the rst and second term of the right hand side represent, respectively, the fractional part and the integral part (integer multiple of II ) of the schedule distance along edge (i; j ). In this chapter, we distinguish between two components of skipi;j : si;j , the minimum (integer) skip factor necessary to hide the latency of the scheduling edge, and pi;j , an additional (nonnegative integer) skip factor used to postpone operation j further. The si;j are input constants, evaluated as in Equation (4.4) below, whereas the pi;j are the fundamental variables determined by the stage scheduler. Consider a scheduling edge between operation i and a dependent operation j , !i;j iterations later. Using Equation (4.2), we may write the dependence constraints, i.e. (timej + !i;j  II ) ? timei  li;j , as:

rdisti;j + (!i;j + si;j + pi;j )  II  li;j

(4.3)

Equation (4.3) states that the distance between the row of operation i to the next row of operation j , increased by skipping !i;j + si;j + pi;j entire instances of the MRT, must be no smaller than the latency li;j of the scheduling edge (i; j ). Since we are interested in the smallest integer value of si;j that satis es Equation (4.3), regardless of the nonnegative value of pi;j , we obtain the following value for the minimum skip 99

factor si;j :

si;j =

&

' li;j ? rdisti;j ? ! i;j II

(4.4)

Consequently, a scheduling edge (i; j ) is satis ed when operation j is skipped at least si;j times after operation i before being scheduled. If !i;j is suciently large, si;j may be negative. Finally, we need to guarantee that all underlying-cycle constraints are also ful lled. Consider a directed closed path in the dependence graph that traverses some underlying-cycle, u. De ne signi;j (u) to be +1 if the closed path traverses edge (i; j ) in the forward direction, and ?1 if the closed path traverses edge (i; j ) in the reverse direction. Signi;j (u) is unde ned for edges (i; j ) that do not belong to underlyingcycle u. Using Equation (4.2), the cumulative distance of underlying-cycle u is thus:

X (i;j )2u

signi;j (u) frdisti;j + (si;j + pi;j )  II g

(4.5)

By setting this cumulative distance to zero, as required, we obtain the underlyingcycle constraint:

X (i;j )2u

signi;j (u) pi;j = ?

X (i;j )2u

signi;j (u) si;j ? u

8u 2 UG

(4.6)

where u is:

X signi;j (u) rdisti;j u = II1 (i;j )2u

(4.7)

for each of the underlying cycles in the set UG of underlying cycles in the underlying graph of the dependence graph G = fV; Esched [ Ereg g. The rst noticeable fact is that all but the pi;j variables are fully de ned by the input parameters to the stage scheduler. We are therefore free to set the pi;j variables to any nonnegative integer value, as long as they satisfy Equation (4.6) for each underlying cycle. The second important fact is that Equation (4.6) has a solution 100

only if u is itself integer valued, since all the si;j and pi;j are integer valued. This is fortunately always the case, since the cumulative distance from an operation to any instance of itself is by de nition a multiple of II .

ld0 vr0 m1

A

vr1

ω 2,1=3

e) Feasable stage schedule

b) MRT (II = 2)

a) Dependence graph

0

ld0

1

st3

a2

m1

c) Input: rdist, latencies, and minimum skip factors

Time:0

ld0

1

st3

2

ld0

3

st3

rdist0,1 = 0

l 0,1 = 1

s 0,1 =

1

4

ld0

rdist1,2 = 0

l 1,2 = 4

s 1,2 =

2

5

st3

6

ld0

7

st3

a2

m1

a2

m1

a2

m1

a2

m1

a2 vr2 st3

load: mult: add: store:

ld0 m1 a2 st3

rdist2,1 = 0

l 2,1 = 1

s 2,1 = −2

rdist2,3 = 1

l 2,3 = 1

s 2,3 =

d) Underlying−cycle constraint

A:

p1,2 + p2,1 = 0

0

p0,1 p1,2 p2,1 p2,3

= = = =

0 0 0 0

Figure 4.5: Computing the underlying cycle constraint (Example 4.3). Example 4.3 Our third kernel, illustrating loop-carried dependences, is:

y[i] =

, where the value of x[i] is read from memory, multiplied by y[i-3], incremented by a, and stored in y[i], as shown in the dependence graph of Figure 4.5a. The backward (bold) edge from the add1 operation to the mult2 operation represents a loop-carried dependence, where the result of the add1 operation is reused by the mult2 operation three iterations later. Its dependence distance is thus !1;2 = 3. x[i]* y[i-3]+ a

The row distance rdisti;j associated with edge (i; j ) can be computed using Equation (4.1). Using this distance and the latency li;j associated with edge (i; j ), the minimum skip factor si;j can be computed using Equation (4.4). These input constants are shown in Figure 4.5c for the MRT-schedule shown in Figure 4.5b. Using these skip factors and computing the  value, 0 according to Equation (4.7), for a 101

counter-clockwise traversal of the underlying cycle, u1, we obtain the underlying cycle constraint shown in Figure 4.5d. A stage schedule satisfying this constraint is shown in Figure 4.5e. Note that the underlying-cycle constraint in this example is only satis ed by p1;2 = p2;1 = 0 since the pi;j must be nonegative integers. a) Dependence graph with minimum skip factors ( [s ij] )

b) Replicated MRT (II = 3)

vr0 ld0 [0] [0] [1] m2

vr5 [1]

a1

m3

vr1 [0]

vr3 [1]

δ u1 δ u2

a4

= 1 = 0

ld0

a1

m3

1

st7

a4

m2

a5

d6

a1

m3

3 4

ld0 st7

5

u2

vr4 [0]

Time:0

2 a5

u1

vr2 [1]

d6 vr6 [2] st7

c) Dependence graph with additional skip factors (

) ij

a4

m2

a5

d6

6

ld0

a1

m3

7

st7

a4

m2

a5

d6

8 9

ld0

a1

m3

10

st7

a4

m2

a5

d6

11

19

st7

vr0 ld0 m2 vr2

a5 vr5

a1

−2

m3





vr1

vr3

0

a4 vr4

d6 vr6

add: mult: div: load: store:

a1,a4,a5 m2, m3 d6 ld0 st7

st7

Figure 4.6: Skip and delta factors for stage scheduling (Example 4.4). Example 4.4 Figures 4.6a and 4.6b, respectively, show the dependence graph and a

stage schedule with minimumintegral MaxLive for the kernel y[i] = (x[i]4+x[i]+a)/ (x[i]+b) which will be used to illustrate our algorithm below. By postponing the schedule of the add1 and add5 operations from times 3 and 2 to times 9 and 8, respectively, the register requirements decrease from 15 to 13. As we will see in the next section, the schedule presented in Figure 4.6b also achieves minimum fractional MaxLive, thus resulting in the minimum register requirements for this kernel, MRTschedule, and set of functional unit latencies. 102

We can now use Equation (4.4) to compute all the si;j values. The resulting skip factors are shown in Figure 4.6a. Using these skip factors and computing the  values according to Equation (4.7), as shown in Figure 4.6a for counter-clockwise traversal of underlying cycles u1 and u2, the constraints for this kernel correspond to the following set of underlying-cycle constraints, from Equation (4.6):

p0;2 + p2;3 + p3;4 ? p1;4 ? p0;1 = ?2 p0;1 + p1;4 + p4;6 ? p5;6 ? p0;5 = 0

Constraint u1: Constraint u2:

(4.8)

It is sucient to introduce one constraint per elementary cycle of the underlying graph, because the constraints of any arbitrary (possibly complex) cycle is simply a linear combination of the constraints of its elementary cycles. For example, the constraint associated with the outer cycle of Figure 4.6 as computed from Equation (4.6) with  = 1 from Equation (4.7): p0;2 + p2;3 + p3;4 + p4;6 ? p5;6 ? p0;5 = ?2 can be obtained by simply adding the constraints for underlying cycles u1 and u2. Therefore, satisfying constraints u1 and u2 will necessarily satisfy the constraint associated with the outer cycle. Although any pi;j values that satisfy these constraints result in a valid stage schedule, we are interested in the solution that minimizes integral MaxLive. In general, the integral part of a virtual register lifetime along a register data ow edge (i; j ) is directly proportional to the sum of the minimum and additional skip factors and the dependence distance along that edge. As a result, we can express the integral part of the lifetime from operation i to operation j , !i;j iterations later, as:

inti;j = si;j + pi;j + !i;j

(4.9)

The integral part of the lifetime associated with vri corresponds to the maximum integral part among all the outgoing register data ow edges of operation i:

inti =

max (! + si;j + pi;j ) 8j s.t. (i;j )2Ereg i;j 103

(4.10)

Using Function (4.10), we compute the integral MaxLive of the kernel of Example 4.4 as the sum of the lifetimes over all virtual registers: max(p0;1 + 1; p0;2; p0;5) + p1;4 + p2;3 + p3;4 + p4;6 + p5;6 + p6;7 + 5

(4.11)

We can now reduce the problem of nding a stage schedule that results in the minimum integral MaxLive to a well-known class of problems, solved by a linear programming (LP) solver [47]. Note, however, that (4.11) is not acceptable as the input to an LP-solver, because the objective function cannot contain any max functions. However, since we are minimizing the objective function, we can remove the max function by using some additional inequalities, called max constraints:

pi;j + si;j + !i;j  bi

8(i; j ) 2 Ereg

(4.12)

Finally, we can remove the constant term in the objective function. The LP-solver input for the kernel and MRT presented in Figure 4.6 is therefore:

Minimize: Constraint u1: Constraint u2: Max Constraints:

b 0 + b1 + b2 + b3 + b4 + b5 + b6 p0;2 + p2;3 + p3;4 ? p0;1 ? p1;4 = ?2 p0;1 + p1;4 + p4;6 ? p0;5 ? p5;6 = 0 p0;1 + 1  b0; p0;2  b0; p0;5  b0 p1;4  b1; p2;3  b2; p3;4  b3 p4;6  b4; p5;6  b5; p6;7  b6

(4.13)

The result of the LP-solver is shown in Figure 4.6c, i.e. p0;1 = p0;5 = 2 and all other pi;j = 0. We can verify that this solution satis es the constraints of Problem (4.13) and yields b0 = 3. Since, except for b0, all the bi in the objective function are 0, this solution has integral MaxLive = b0 + 5 = 8 registers. The corresponding stage schedule is shown in Figure 4.6b. 104

Algorithm 4.1 (MinBu Stage Scheduler) A schedule with the minimum inte-

gral MaxLive for a kernel with a general dependence graph, MRT, and set of functional unit latencies is found as follows:

1. Compute all si;j using Equation (4.4) and search for all elementary cycles in the underlying graph of the dependence graph GfV; Esched [ Ereg g [74]. 2. If the underlying dependence graph is acyclic, the solution that produces the minimum integral MaxLive is obtained by setting the values of all pi;j to zero. 3. Otherwise, build an underlying-cycle constraint for each elementary cycle of the underlying dependence graph using Equations (4.6) and (4.7). Add the max constraints using Inequality (4.12) and derive the integral MaxLive objective function by summing the b variables over i. 4. Solve the system of constraints to minimize the integral MaxLive by using an LPsolver. The solution de nes the pi;j values that result in the minimum integral MaxLive. Note that the solver may fail to nd any solutions because of some critical recurrence cycles in the dependence graph. In this case, the given MRT has no solution and a new MRT-schedule must be provided, possibly with a di erent II .

We prove the correctness of this algorithm with two theorems. The rst theorem validates the correctness of the solution for general dependence graphs, and the second theorem validates the solution and the linear solution time for the case of an acyclic underlying dependence graph.

Theorem 4.1 The solution of the minimum integral MaxLive LP-problem as de ned

in Algorithm 4.1 results in the minimum integral MaxLive for a kernel with a general dependence graph, MRT, and set of latencies. The solution is integer valued, if an LP-solver nds an optimal solution.

Proof. We rst demonstrate that a stage schedule is valid if and only if it satis es the nonegative pi;j constraints and the underlying-cycle constraints. Since each si;j 105

is de ned in Equation (4.4) to be the smallest integer value that minimally satis es scheduling constraint (4.3), this scheduling constraint is satis ed if and only if pi;j  0. Thus, enforcing nonegative pi;j values for each (i; j ) 2 Esched satis es each scheduling constraint individually. In addition to enforcing the scheduling dependences, a valid schedule must also enforce the assignment constraints, which insure that each operation is assigned to a unique time (see Section 5.3.1 for example). Because we represent a schedule in our formulation by the time di erences along the edges of the dependence graph, enforcing the assignment constraints is equivalent to insuring that every path leading to a vertex results in precisely the same time. If there are two distinct paths leading to a vertex, the union of the edges in these two paths must create one or more underlying cycles. Because the underlying cycle constraints (4.6) precisely enforce that the cumulative distance along the closed path of each underlying cycle is 0, the two distinct paths along an underlying cycle result in the same time di erence between their two end points. Thus, enforcing the assignment constraints is equivalent to enforcing the underlying cycle constraints. Note that satisfying the underlying cycle constraints only for the elementary underlying cycles is sucient, since any nonelementary cycle can be simply expressed as a linear combination of several elementary cycles. Consequently, a stage schedule is valid if and only if it satis es the pi;j  0 and the underlying-cycle constraints. Moreover, since the underlying-cycle constraints, the max constraints, and the objective function (minimize integral MaxLive) are all linear, we can use an LP-solver to nd a schedule that minimizes integral MaxLive. Furthermore, since the solution (namely the set of pi;j values) must be (nonnegative) integer valued, we must show that if an LP-solver nds a optimal solution, the pi;j in that solution will be integer valued. This part of the proof is provided in Appendix A.

2

Theorem 4.2 For a dependence graph with an acyclic underlying dependence graph, the minimum integral MaxLive for a given kernel, MRT, and set of latencies is found in a time that is linear in the number of edges in the dependence graph. The minimum integral MaxLive is found by simply setting all pi;j values to zero.

106

Proof. Consider applying the general form of Algorithm 4.1 (Steps 1, 3, and 4)

to a dependence graph with an acyclic underlying dependence graph. There are no underlying cycles, and hence no underlying-cycle constraints, in the linear program problem formulation, e.g. see (4.13). The max constraints all have the form pi;j  bi + ci;j where bi and pi;j are to be found and ci;j is a constant. The objective is to minimize a linear combination of the bi, all of which have nonnegative coecients. Consequently the solution cost is monotonically nondecreasing in the bi. Setting all pi;j = 0, their minimum possible value, also allows the bi to have their minimum feasible value and minimizes the objective function. Therefore, Steps 3 and 4 take no time and Step 2 of Algorithm 4.1 can be used instead. The solution time for Algorithm 4.1 is thus reduced to the time for Step 1. The si;j are each computed in constant time using Equation (4.4). Since there is one si;j parameter per edge in the dependence graph, the solution time is linear in the number of edges. 2 A corollary of Theorem 4.2 states that setting the pi;j to 0 for each edge in an acyclic part of an underlying dependence graph results in the minimum integral part for those edges; furthermore, increasing the pi;j of any of these edges cannot decrease integral MaxLive. This corollary can be shown by applying the proof of Theorem 4.2 for the acyclic parts. We conclude this section by contrasting the complexity of our stage scheduling model summarized in Figure 4.7 with the modulo scheduling model presented by Govindarajan et al [44], where both models minimize integral MaxLive for a machine with nite resources, but we assume a given MRT and Govindarajan et al do not. Note that since their algorithm produces an MRT as well, it considers a larger scheduling space and may thus result in schedules with smaller integer MaxLive. We address algorithms that explore the entire modulo scheduling space in Chapter 5. The most important di erence between these two algorithms that minimize integer MaxLive is that the stage scheduling model can be directly solved with an LP-solver instead of an integer linear programming solver which is usually implemented by solving an exponential number of LP-problems. The stage scheduling model also has 107

Minimize:

Bu =

X i2V

bi

Subject to: Underlying Cycles:

X (i;j )2u

signi;j (u) pi;j = ?

X (i;j )2u

signi;j (u) si;j ? u

8u 2 UG Max:

bi ? pi;j  !i;j + si;j

8(i; j ) 2 Ereg

Lower Bounds:

bi  0; integer

8i 2 V

pi;j  0; integer

8(i; j ) 2 Esched [ Ereg

De nitions:

rdisti;j = (rowj ? rowi ) mod II

X signi;j (u) rdisti;j u = II1 &

(i;j )2u

' l i;j ? rdisti;j si;j = ? !i;j II

Figure 4.7: Stage scheduling model for minimum integral MaxLive (minimum Bu ers).

108

a vastly smaller number of variables and constraints. For example, all the resource constraints are resolved prior to stage scheduling during the MRT-scheduling phase, and therefore are ignored in the stage scheduling model. Also, most of the dependence constraints present in the modulo scheduling model disappear in the stage scheduling model as most of them are resolved statically by selecting appropriate minimum skip factors (si;j ). The only remaining dependence constraints are the underlying cycle constraints. Similarly, most of the constraints required to compute the register requirements associated with a solution are also computed statically in the stage scheduling model. The only pi;j that actually must be accounted for are the ones where operation i is in an underlying cycle; the corollary of Theorem 4.2 insures that all other pi;j will be 0 in an optimum stage schedule. As a result, the stage scheduling model results in solution times that are signi cantly smaller than those of the modulo scheduling model, and thus the stage scheduler is an interesting tool for analyzing large loops, even though it considers only the given MRT.

4.3.2 Scheduling for Minimum MaxLive In this section, we extend the stage scheduling algorithm of the previous section to consider the fractional as well as the integral parts of the register requirements. We rst quantify the fractional MaxLive associated with a kernel, an MRT, and a stage schedule. Then we investigate the interaction between the fractional MaxLive and the search for an optimal stage schedule. We conclude the section by presenting an algorithm that minimizes the total MaxLive among all stage schedules for a given MRT. Consider a register edge (i; j ) between operations i and j scheduled in rowi and rowj , respectively. We characterize the fractional part of the lifetime associated with register edge (i; j ) in row r by fracr;i;j , where fracr;i;j is 1 if the fractional part of the lifetime spans row r, and 0 otherwise. In general, fracr;i;j is 1 if row r is encountered in the MRT when going forward (with wraparound) from rowi to rowj , i.e. from the row of def-operation i to the row of use-operation j , inclusively. This 109

de nition implies that fracr;i;j is 1 if the modulo distance from rowi to r is no larger than the modulo distance from rowi to rowj , i.e. if dist(rowi; r)  rdisti;j , using the de nition of the dist function and the rdisti;j terms shown in Equation (4.1). Using the following generic function X : 8 > < 1 if condition is true (4.14) X (condition) = >: 0 otherwise we may express the fractional part of the lifetime generated by register edge (i; j ) in row r as:

fracr;i;j = X (dist(rowi; r)  rdisti;j )

(4.15)

Note that Equation (4.15) is based on our virtual register model, which assumes that a virtual register is reserved when its def-operation is scheduled and freed one cycle after its last use-operation is scheduled. Since, in general, the register requirements of a virtual register are determined only by its longest lifetime, i.e. by the edge from its def-operation to its last use-operation, we introduce two following de nitions:

lasti ldisti

operation that last uses vri distance between the row of operation i to the row of operation lasti, possibly in the next instance of the MRT.

Using our formulation, lasti and ldisti are determined as follows:

lasti = k s.t. (i; k) 2 Ereg and rdisti;k + (!i;k + si;k + pi;k )  II = max rdisti;j + (!i;j + si;j + pi;j )  II (4.16) 8j s.t. (i;j )2E reg

ldisti = dist(rowi ; rowlasti )

(4.17) 110

The sum of the fractional parts in row r, referred to as fracr , is then obtained by summing the fractional part of each operation, i, to its last use-operation, lasti, namely:

fracr =

X

X (dist(rowi; r)  ldisti)

(i;lasti)2Ereg

(4.18)

where X (dist(rowi ; r)  ldisti) corresponds to fracr;i;lasti . Because fractional MaxLive is determined by the row with the most fractional parts, it is computed as follows: fractional MaxLive =

max fracr

r=0;:::;II ?1

(4.19)

Total MaxLive is then simply the sum of integral and fractional MaxLive, namely:

X

RMRT =

(i;lasti)2Ereg

max r=0;::: ;II ?1

X

(i;lasti)2Ereg

(!i;lasti + si;lasti + pi;lasti ) +

X (dist(rowi; r)  ldisti)

(4.20)

where the rst and second terms correspond, respectively, to the integer and fractional MaxLive associated with the given MRT and stage schedule. We can now formalize the interaction between the fractional and the integral parts of the register requirements as follows. When searching for a stage schedule with minimum integral MaxLive, the solver determines the pi;j values, which in turn determine fractional MaxLive, as indicated by Equations (4.16), (4.17), (4.18), and (4.19). Since fractional MaxLive is not taken into account when minimizing integral MaxLive, it is possible that there is another minimum integral MaxLive stage-schedule with lower fractional MaxLive, or even another stage-schedule with higher integral MaxLive but with lower fractional and total MaxLive. Consider, for example, a minimum integral MaxLive stage-schedule that does not result in minimum fractional MaxLive. Since the schedule does not achieve fractional MaxLive, we know that there must be at least one virtual register, vri, for which another use-operation, x, would lower fractional MaxLive if it were the last 111

use-operation. By introducing additional constraints in the linear program, we may arti cially force operation x to be the last use-operation of vri, i.e. we can force lasti = x. By solving this augmented problem, i.e. by searching for a stage schedule with minimum integral MaxLive among the subset of all stage schedules that have lasti = x, we may nd out if fractional MaxLive can be decreased without increasing integral MaxLive. Note that, in general, there may be one or more use-operations of one or more virtual registers that can decrease fractional MaxLive. In these cases, we may have to investigate combinations of forced last use-operations, that can possibly achieve a lower MaxLive. Note that even though the fractional MaxLive can be reduced by at most one by altering the contribution from a single virtual register, it is possible to achieve lower total MaxLive values even if integral MaxLive is increased, because the increase in integral MaxLive may be smaller than the decrease in fractional MaxLive caused by changing the last use-operations of several virtual registers. We may now formulate our general approach to minimizing total MaxLive. First, we search for a minimum integral MaxLive stage-schedule among all stage schedules that share a given MRT, using Algorithm 4.1. The resulting schedule, referred to as the base solution, achieves minimum integral MaxLive but may not, in general, result in minimum fractional MaxLive. The second step is to identify the combinations of lasti operations that decrease fractional MaxLive below that of the base solution. In the remainder of this section, we refer to such combinations of lasti operations as forcing combinations, and the lasti operations as their forcing candidates. We assume in this section that every possible combination is investigated; in Section 4.3.3, however, we will present inclusion properties that reduce the number of forcing combinations that must be investigated to guarantee an optimal solution. Finally, we investigate each forcing combination in turn, by augmenting the linear program that was used to compute the base solution with additional constraints associated with its forcing candidates, and applying Algorithm 4.1 to the augmented problem formulation. The additional constraints, referred to as forcing constraints are de ned as follows. Consider a forcing candidate lasti = x that may result in a reduced fractional 112

MaxLive if operation x is scheduled as the last use-operation of vri. We could simply implement these constraints by forcing every other use-operation of vri to be scheduled no later than operation x. While these constraints would clearly work, they may be overly constraining because some of these other use-operations may not increase the fractional parts beyond that of operation x, even if they were scheduled last. Thus, we can formulate forcing constraints that consider only the use-operations that e ectively increase the fractional part of a virtual register lifetime beyond that of operation x, i.e. force each operation z such that (i; z) 2 Ereg with rdisti;z > rdisti;x not to be the last use-operation of virtual register i. We may thus formulate the forcing constraints associated with the forcing candidate lasti = x as:

!i;z + si;z + pi;z < !i;x + si;x + pi;x 8 z s.t. (i; z) 2 Ereg and rdisti;z > rdisti;x

(4.21)

To illustrate this process, consider Example 4.4 again. As shown in Figures 4.8a and 4.8b, three operations use vr0, add1, mult2, and add5 scheduled in rows 0, 1, and 2, respectively. The fractional parts associated with Example 4.4 are shown in Figure 4.8c with three di erent cases for vr0 depending on which of the three operations is the last use of vr0. As indicated before, the rst step is to search for a schedule with minimum integral MaxLive; in this case, we initially obtained a schedule (referred to as the base solution) with an integral MaxLive of 8 and a fractional MaxLive of 6, because add1 was scheduled II cycles earlier than shown in Figure 4.4b and thus add5 was the last use-operation of vr0. The second step is to nd the forcing combinations, if any, that decrease the fractional MaxLive of the base solution. Since there is only one virtual register with multiple use-operations, with 3 use-operations in this example, there are precisely 3 possible last use-operations. Of these three, since the fractional MaxLive of 6 can only be reduced if add1 is forced as the last use-operation of vr0, only last0 = add1 results in a lower fractional MaxLive than achieved by the base solution. To investigate the stage schedules that result from forcing last0 to be the candidate add1, we introduce the forcing constraints 113

d) Dependence graph with additional constraints

a) Original dependence graph vr0 ld0

vr0 ld0

m2

a5

vr2

a1

a5

m2

vr5

vr2

vr5

a1

vr1

vr1

m3

m3

vr3

vr3 a4

a4

vr4

vr4 d6

d6

vr6

vr6

st7

st7 c) Fractional parts

b) MRT (II = 3)

vr0 row:0

ld0

a1

m3

1

st7

a4

m2

a5

d6

2

a1 last−use of vr0

vr0

vr0

vr1

m2 last−use of vr0

vr2

vr3

vr4

vr5

vr6

a5 last−use of vr0

Figure 4.8: Fractional part of the lifetimes (Example 4.4). produced by Inequality (4.21). Since the 2 other use-operations of vr0 are both associated with larger fractional parts, we must introduce 2 forcing constraints, one for the mult2 and one for the add5 operation. These constraints are depicted in the dependence graph of Figure 4.8d by the additional (bold) edges. Solving the augmented problem formulation for a stage schedule with minimum integral MaxLive also results in a schedule with an integral MaxLive of 8 in this example, thus achieving a lower fractional MaxLive without increasing integral MaxLive. The resulting stage schedule is the one shown in Figure 4.6b.

114

Algorithm 4.2 (MinReg Stage-Scheduler) A schedule with the minimum MaxLive

for a kernel with a general dependence graph, MRT, and set of functional unit latencies is found as follows:

1. Use Algorithm 4.1 to compute the minimum integral MaxLive. We refer to this resulting stage schedule as the base solution. If Algorithm 4.1 fails to produce a stage schedule, which may happen because of some critical recurrence cycles in the dependence graph, a new MRT-schedule must be provided, possibly with a di erent II . 2. If the underlying dependence graph is acyclic, the base solution results in the minimum MaxLive. 3. Otherwise, compute the total MaxLive associated with the base solution using Equation (4.20) and determine the forcing combinations that decrease the fractional MaxLive of the base solution. While all possible forcing combinations can be exhaustively computed and used, Algorithm 4.3 in Section 4.3.3 indicates how to use inclusion properties to reduce the number of forcing combinations that need to be investigated. 4. Use Algorithm 4.1 repeatedly to compute the minimum integral MaxLive for the base system augmented by the forcing constraints associated with each forcing candidate of the current forcing combination. Forcing constraints are generated using Inequality (4.21). By processing each forcing combination in order of increasing fractional MaxLive, the rst solution that achieves the integral MaxLive obtained by the base solution is optimal. If no such solution is found then continue evaluating forcing combinations until the decrease in fractional MaxLive below the base solution for all remaining unevaluated combination is no greater than the reduction in total MaxLive obtained by the best solution yet found. Then this best solution found is optimal.

115

We prove the correctness of this algorithm with two theorems. The rst theorem validates the correctness of the solution for general dependence graphs, and the second for acyclic underlying graphs.

Theorem 4.3 Algorithm 4.2 produces a solution with the minimum MaxLive for a

kernel with a general dependence graph, MRT, and set of latencies. Furthermore, any solution found will be integer valued, if an LP-solver nds an optimal solution.

Proof. Since we test all the forcing combinations that may reduce the fractional

MaxLive, and since Algorithm 4.1 produces the minimal integral MaxLive for each of the problem augmented by the forcing conditions, the solutions of Algorithm 4.2 and the original unaugmented solution must include the minimum achievable total MaxLive for this MRT. By ordering the forcing combinations by increasing order of fractional MaxLive, we may stop the search as soon as a solution with reduced fractional Maxlive and the minimum integral MaxLive obtained by the base solution is found. This solution is optimal since we rst checked all forcing combinations with lower fractional MaxLive and since we cannot, by de nition, reduce the integral MaxLive below that of the base solution. If no such solution is found, the nal stopping criterion in Step 4 of the algorithm stops considering new combinations when it is impossible for any of them to result in a better solution than the best solution found so far. Moreover, the additional forcing constraints de ned by Inequality (4.21) can be represented in the original dependence graph by additional scheduling edges. Therefore, from Theorem 4.1, any solution produced by an LP-solver for each use of Algorithm 4.1 is guaranteed to be integer valued, if an LP-solver nds an optimal solution.

2

While there is potentially an exponential number of forcing combinations to investigate, this number tends in practice to be relatively small for several reasons. First, the number of forcing combinations is dependent on the number of operations lying in underlying cycles, which is usually only a fraction of the total number of operations. Second, we have to consider only those forcing combinations that potentially reduce 116

the fractional MaxLive. For example, we do not need to consider forcing combinations unless they may impact the value of the max function in Equation (4.20). Finally, we can use some inclusion properties to reduce the total number of forcing combinations investigated. More details are provided in Section 4.3.3.

Theorem 4.4 For dependence graphs with acyclic underlying graphs, the solution that minimizes integral MaxLive also minimizes the total MaxLive for a given kernel, MRT, and set of latencies.

Proof. In general, the solution that minimizes integral MaxLive may not be a solu-

tion with the minimal fractional MaxLive. Recall, however, that the Algorithm 4.1 solutions for acyclic underlying graphs have all pi;j = 0. Therefore, reducing a fractional lifetime for some vri requires setting at least one pi;j to 1 or more (i.e. setting pi;j to at least the value that minimally forces operation j to be the last use-operation of vri, as computed in Equation (4.16)), thereby increasing the integer lifetime for vri by at least 1. Therefore, reducing the fractional register requirement for vri cannot decrease the MaxLive of this virtual register. As the lifetime of each vr is independent of the lifetime of any other vr for an acyclic underlying graph, it is never advantageous to force a di erent last use-operation unless the total lifetime of the associated vr is decreased. Thus, we cannot reduce the register requirements of the base solution if the underlying graph is acyclic. 2 A corollary of Theorem 4.4 states that Theorem 4.4 can also be applied to acyclic parts of the underlying dependence graph (i.e. to a multiuse virtual register lifetime none of whose def-use edges are in an underlying cycle), because in acyclic parts of the underlying graph, increasing such a lifetime cannot decrease MaxLive, as shown in the proof of Theorem 4.4. This result is in contrast to the cyclic part of the underlying graph, where Theorem 4.4 does not apply since increasing the length of one lifetime may decrease the length of another, and thus may possibly decrease MaxLive.

117

4.3.3 Determining Forcing Combinations In the previous section, we reduced the value of fractional MaxLive associated with a stage schedule by forcing the last use-operations of multiuse variables with at least one def-use edge in an underlying cycle. In this section, we proceed in the reverse fashion: we determine a target value for fractional MaxLive and investigate the resulting forcing candidates and forcing combinations that achieve this target fractional MaxLive. a) MRT (II = 3)

b) Fractional parts vr0

row:0

ld0

a1

m3

1

st7

a4

m2

a5

d6

2

lastuse: a1

vr0

vr0

m2

a5

potential fractional parts

vr1

vr2

vr3

vr4

vr5

vr6

minimum fractional parts

Figure 4.9: Fractional parts of the lifetimes (Example 4.5). Since the last use-operations are unknown in this section, we distinguish two components of the fractional parts associated with a virtual register. The rst component corresponds to the minimum fractional parts and spans (inclusively) from the row of the def-operation to the row of the use-operation that is rst encountered in the MRT when going forward, with wraparound. This component is illustrated by the black bars in Figure 4.9b. By counting the black bars in each row, we can obtain a lower bound on the fractional MaxLive, 5 in this example. The second component corresponds to the potential fractional part and spans (exclusively) from the rows of the use-operation that was rst encountered to the row of the use-operation that is last encountered in the MRT when going forward, with wraparound. This component is illustrated by the grey bars in Figure 4.9b. Depending on which operation is scheduled last, these gray bars may a ect the overall fractional MaxLive. In this example, fractional MaxLive increases from 5 to 6 if either of the operations m1 or a3 is the 118

last use of vr0. Therefore, forcing operation a2 to be the last use ensures a fractional MaxLive of 5. Using Equation (4.18), we describe the number of fractional live registers for each row of the MRT as a system of equations, in which the minimal fractional parts are reduced to a constant term. We obtain for Example 4.5 the following fracr equations for each of the MRT rows:

frac0 = 5 frac1 = 5 + X (1  ldist0) frac2 = 4 + X (2  ldist0)

(4.22)

which results from the fact that there are 5 minimum fractional parts in row 0, 5 minimum fractional parts and one potential fractional part in row 1, and 4 minimum fractional parts and one potential part in row 2. By looking at this system of equations, we can deduce that the value of fractional MaxLive is 6 unless if X (1  ldist0) = 0, in which case MaxLive is 5. Using the de nition of X , in Equation (4.14), forcing X (1  ldist0) to 0 corresponds to forcing 1  ldist0 to be false, namely: 1 > ldist0 = dist(row0; rowlast0 ) Thus, forcing X (1  ldist0) to 0 constrains the last use of vr0 to be in row 0 since, otherwise, ldist0 is strictly larger than 0. As a result, we have just found one forcing candidate, namely last0 =a2, since it is a use-operation of vr0 that is assigned to row 0; therefore, forcing a2 to be the last use results in a fractional MaxLive of 5. From the above description, we can see that the fewer equations in the system, the fewer the forcing combinations that have to be investigated by an LP-solver. Therefore, simplifying the system of equations is crucial to obtaining an ecient algorithm. We list here some rules that can be used to prune the system of equations without compromising the optimality of the solution. All rules will use a stylized version of 119

Equation (4.22):

fracr = cr + X1 + : : : + Xnr

(4.23)

where the constant cr represents the number of minimal fractional parts in row r and the X terms represent the nr distinct conditions for the potential fractional parts in row r. If only solutions with fractional MaxLive  T are interesting, then the following set of constraints apply:

fracr = cr + X1 + : : : + Xnr  T

8r 2 [0; II )

(4.24)

However those that are trivially satis ed or covered may be pruned, as shown by the following rules.

Rule 4.1 (Partial Evaluation of X Terms) Consider a multiuse virtual register vri for which none of its def-use edges belong to an underlying cycle of the dependence graph. In this case, we can statically evaluate each of the X (dist(rowi ; r)  ldisti) terms associated with vri , because the corollary of Theorem 4.4 indicates that reducing the fractional part of a multiuse virtual register lifetime none of whose def-use edges are in an underlying cycle necessarily increases its integral part. Statically evaluating X terms can be eliminated from all fracr expressions, with a corresponding adjustment in cr .

Rule 4.2 (Noncritical Row) Consider an equation fracx and a target fractional

MaxLive T :

fracx = cx + X1 + : : : + Xnx  T Equation fracx can be safely eliminated if cx + nx  T holds.

120

Proof. Since even in the worst case scenario, where all X1 : : : Xnx evaluate to 1, the number of fractional parts in row x would still be no larger than the target fractional MaxLive, T . Thus this constraint is trivially satis ed. 2

In our example, with a target fractional MaxLive of 5, this rule results in the elimination of equations frac0 and frac2 in Equation (4.22).

Rule 4.3 (Included Row) Consider two equations fracx and fracy where the X terms of equation fracx are a subset of the X terms in equation fracy , namely: fracx = cx + X1 + : : : + Xnx fracy = cy + X1 + : : : + Xnx + Xnx +1 + : : : + Xny Equation fracy can be safely removed if cy + ny  cx + nx holds, since equation fracx is more constraining.

Proof. We demonstrate here that the number of fractional parts of the lifetime in

row y is guaranteed to be smaller than the number in row x, and therefore enforcing equation fracx is sucient. Assuming the worst case scenario where all the ny ? nx terms Xnx+1 + : : : + Xny evaluate to 1, we can write:

fracy  cy + X1+; : : : ; +Xnx + (ny ? nx) Using the condition cy + ny  cx + nx, we can in turn bound the left hand side of the previous inequality as follows:

fracy  cy + X1 + : : : + Xnx + (ny ? nx)  cx + X1 + : : : + Xnx = fracx Therefore, satisfying fracx necessarily satis es fracy . 2 We can now describe an algorithm that computes all forcing combinations that result in lower fractional MaxLive. This algorithm speci es Step 3 of Algorithm 4.2 in more detail. 121

Algorithm 4.3 The set of forcing combinations that results in a target fractional

MaxLive of T or less are computed as follows:

3.1 Build the fracr equations for each row of the MRT using Equation (4.18). 3.2 Perform the partial evaluation of all possible terms using Rule 4.1 and then eliminate noncritical and included fracr equations using Rules 4.2 and 4.2, for the target fractional MaxLive, T . 3.3 Build the set of all forcing combinations that satis es the target fractional MaxLive as follows. We initialize the current forcing combinations to be empty. We pick a fracr equation and select the minimum number of X terms that must be 0 to satisfy the target fractional MaxLive, T . We record the forcing candidates associated with each of the X terms that are set to 0 in the current forcing combination. We then proceed similarly with each fracr equation in turn. Eventually, when each fracr equation has been satis ed, we have obtained the rst forcing combination, which we store. We then backtrack, unselect a X term, remove the forcing candidates corresponding to this unselected X term from the current forcing combination, select a di erent X term, and add the forcing candidates corresponding to the newly selected X term to form another forcing combination, and store it. Proceed in this fashion until all combinations of X in all fracr equations have been exhausted.

Note that Algorithm 4.3 is rst used with the minimum possible T , and once each forcing combination for a given T has been investigated, we increment T by 1 and repeat until an optimal solution has been found by Algorithm 4.2.

122

4.4 Stage Scheduling Heuristics We now investigate fast and ecient heuristics for stage scheduling. The heuristics reduce the register requirements of a modulo schedule by reassigning some operations to di erent stages, i.e. by shifting operations by multiples of II cycles. The input for these heuristics is a valid initial modulo schedule, and the output is a valid modulo schedule with reduced register requirements. First, we formalize the input to the stage scheduling heuristics and present the running example used to illustrate our techniques in Section 4.4.1. In Section 4.4.2, we describe the basic stage scheduling transforms that are used by the stage scheduling heuristics to modify the stage in which operations are scheduled. We then present the four stage scheduling heuristics considered in this chapter, in Sections 4.4.3, 4.4.4, 4.4.5, and 4.4.6. We summarize in Section 4.4.7 by illustrating the e ect of these heuristics, and combinations of them, on our running example.

4.4.1 Input to Stage Scheduling Heuristics The initial modulo schedule is given by its initiation interval II and the time at which each operation of the rst loop iteration is scheduled. Because the stage scheduling heuristics operate on the dependence graph annotated by the additional skip factors (i.e. the pi;j factors), we rst illustrate how to derive the pi;j factors associated with the initial modulo schedule. We derive here the initial pi;j associated with edge (i; j ), given II and the scheduling time of operations i and j in the initial schedule, timei and timej , respectively. Recall that the time interval along edge (i; j ) is de ned in Equation (4.2) as follows:

timej ? timei = rdisti;j + (si;j + pi;j )  II By isolating pi;j and substituting the right hand side of Equation (4.4) for si;j , we

123

obtain:

pi;j

& ' time l j ? timei ? rdisti;j i;j ? rdisti;j = ? + !i;j II II

Since ?dxe = b?xc, we may transform the above equation to: $ % rdist time i;j ? li;j j ? timei ? rdisti;j + pi;j = + !i;j II II Note that the rst term of the right hand side of this equation is guaranteed to be integer valued since rdisti;j is de ned as (rowj ? rowi ) mod II , in Equation (4.1). Thus, the (timej ? timei ? rdisti;j )=II term can be safely introduced inside the oor function of the second term. After canceling the rdisti;j terms, we obtain the following equation for the initial pi;j associated with edge (i; j ), given the scheduling time of operations i and j in the initial schedule: % $ time j ? timei ? li;j + !i;j (4.25) pi;j = II The input to the stage scheduling heuristics simply corresponds to the dependence graph of the loop annotated by the initial pi;j values, as computed by Equation (4.25).

Example 4.5 We will illustrate our techniques using the following kernel as a

running example: z[i] = SQRT(x[i]/ (1/x[i]+x[i]+a)) +(y[i]+a) *(y[i]+b) *(y[i]+c), computed as presented in Figure 4.10a. This example, although constructed to illustrate our heuristics concisely, is representative of several kernels found in our benchmark suite. A modulo schedule with II = 6 for our target machine is presented in Figure 4.10b. This schedule is a good example of a schedule produced by a register insensitive modulo scheduler: it results in a schedule with high throughput and provides a good stage schedule along the critical path (add0, load1, div3, add4, div5, sqrt6, add7, store8) so as to minimize the schedule length of the loop in the typical fashion. However, operations not on the critical path tend to be scheduled too early, e.g. operation add9 is scheduled at time 5 when it could be scheduled as late as time 29. This e ect is due to the fact that current modulo schedulers favor 124

scheduling operations early to prevent needlessly increasing either II or the schedule length. Stage scheduling will thus make a signi cant di erence unless the number of stages or the number of operations not on the critical path are too small. Trying different MRTs for a given loop iteration is also likely to become increasingly important as II increases. The dependence graph in Figure 4.10a has three self-loops on the address arithmetic nodes (add0, add9, and add10) and four elementary underlying cycles in the underlying graph G = fV; Esched [ Ereg g. The dependence graph for Example 4.5 is representative of the ones found in our benchmark suite, as on average more than 50% of the operations of a loop iteration belong to an underlying cycle. The values of nonzero pi;j are shown as label on their respective edges. Because we describe a stage schedule using time-di erences associated with edges instead of absolute times associated with operations, we must ensure that the underlying cycle constraint, de ned in Equation (4.6), holds for each elementary underlying cycle. Since we are given a valid modulo schedule, the constant term de ned in Equation (4.7) can be quickly computed as the sum of the pi;j from the given modulo schedule, where a pi;j is taken as negative when its edge is traversed in the reverse direction. These constant terms are shown in Figure 4.10a inside the curved dashed arrows.

4.4.2 Stage Scheduling Transforms The rst transform modi es the additional skip factors associated with cut edges, where a cut edge is an edge in the dependence graph that, if removed, cuts the graph into two disconnected parts. Note that an edge is a cut edge if and only if it does not lie in a cycle of the underlying dependence graph. We proved in Theorem4.2 that the additional skip factors of a dependence graph without underlying cycles can be individually minimized by setting all their values to zero. We apply this theorem here to the parts of the underlying dependence graph that are acyclic. 125

a) Dependence graph with additional skip factors

b) Schedule (II=6) Time:

a9

a12

a10

a0

l11

l1

a13

a2

a14

1

0

d3

−1



−1





a4

m15



d5

m16 r6

0: 1: 2: 3: 4: 5: * 10: 11: * 15: * 19: * 29: 30:

Operations a0, a10; l1; a2, d3, l11; a12, a13; a14; a9; a4; d5; m15; r6, m16; a7; s8;

long latency operations mult: m15, m16 (4 cycles latency) div: d3,d5 (8 cycles latency) sqrt: r6 (10 cycles latency)

a7



s8

time register and scheduling edge (bold: edge in underlying cycle)

short latency operations add: a0,a2,a4,a7, a9,a10,a12, a13,a14 (1 cycle latency) load: l1,l11 (1 cycle latency) store: s8 (1 cycle latency)

non−zero additional skip factor cumulative additional skip factor along an underlying cycle

Figure 4.10: Initial stage-schedule with additional skip factors (Example 4.5).

126

Theorem 4.5 (Acyclic Transform) Setting to zero the additional skip factor of a

cut edge of the underlying dependence graph results in a valid stage schedule.

Proof. To guarantee that the acyclic transform results in valid stage schedules, we

must show that each pi;j remains nonnegative and that the sum of the pi;j along each underlying cycle of the dependence graph remains unchanged. By de nition, a cut edge cannot be part of any underlying cycle as, otherwise, the removal of that edge would not result in an unconnected underlying graph. As a result, this transform does not modify the value of the additional skip factors along any underlying cycles. Consequently, the sum of the pi;j along each underlying cycle remains constant and each pi;j remains nonnegative. 2 a) Initial schedule

b) Modified schedule u1

u1 j1

j2

j1

j2









i u2

k1

i

k2

u3







u2

k3

u4

k1

k2

u3







k3

u4

underlying cycle additional skip factor register and scheduling edge

Figure 4.11: Up-propagation transform. The second transform shifts some of the additional skip factors associated with the output edges of a vertex to the input edges of that vertex. Figure 4.11 illustrates this transform, where operation i is scheduled three stages later, i.e. three units of additional skip factor are propagated up, from the output edges to the input edges of operation i. 127

Theorem 4.6 (Up-Propagation Transform) Reducing the additional skip factor of each output edge of a vertex by the minimum additional skip factor among these edges and increasing the additional skip factor of each input edge by the same amount results in a valid stage schedule.

Proof. By decreasing the pi;j of each output edge of vertex i by the minimum pi;j

among these edges, the pi;j values are guaranteed to remain nonnegative. Since the pi;j of each input edge is increased by the same amount as the decrease of each output edges, these pi;j values remain nonnegative and the sum of the pi;j along each underlying cycle is guaranteed to remain constant. 2 A similar transform, down-propagation, shifts the additional skip factor down instead of up, and is obtained by interchanging the words input and output in the description of the up-propagation transform.

4.4.3 Acyclic Heuristic (AC) The rst heuristic reduces the register requirements associated with edges that belong to acyclic parts of the underlying dependence graph. This heuristic has a linear computational complexity in the number of edges and results in valid stage schedules with no larger register requirements. First, the heuristic detects all cut edges, using a linear-time algorithm that enumerates the biconnected components of an undirected graph [1, pp. 179{187], i.e. the underlying cycles of the dependence graph. Cut edges are then simply the edges linking vertices from distinct biconnected components. Note that self-loop edges can safely be ignored during the stage scheduling process, because they do not introduce any scheduling constraints for the stage scheduler2. The heuristic then performs the constant-time acyclic transform on each of the cut edges. Since acyclic transforms only decrease the pi;j , and since the register 2 The scheduling constraints generated by self-loop edges simply yield a lower bound on II and, since the stage scheduling heuristics do not change II , these constraints (and edges) can be safely ignored during stage scheduling.

128

requirements associated with a stage schedule is the sum of the skip factors of the last use-operations plus other unrelated terms, the register requirements cannot increase. This heuristic highlights the advantage of describing a stage schedule using skip factors associated with edges instead of absolute scheduling times associated with operations. In Figure 4.10a there are 2 cut edges with nonzero pi;j . Setting p16;7, the additional skip factor between operations mult16 and add7, to 0 e ectively delays mult16 and all its predecessors by II cycles. This simple local change thus reschedules operations add10, load11, add12, add13, add14, mult15, and mult16 one stage later. Similarly, setting p9;8 to 0 delays add9 by four stages. Because the additional skip factor of a cut edge can always be set to 0, without a ecting any of the underlying cycles and regardless of the rest of the dependence graph, the AC heuristic will be used rst, and subsequent heuristics will ignore both cut edges and self-loop edges. All the following heuristics will thus operate only on the nontrivial cyclic parts of the underlying graph. In Figure 4.10a, there are two nontrivial biconnected components, one containing the operations load1, add2, div3, add4, and div5, and the other containing the operations load11, add12, add13, add14, mult15, and mult16. The edges of these components are shown in bold in Figure 4.10a.

4.4.4 Sink/Source Heuristic (SS) The second heuristic addresses single vertices that are scheduled too early or too late. It considers only sink vertices, i.e. vertices with no output edges, and source vertices, i.e. vertices with no input edges. This heuristic has a linear computational complexity in the number of edges and results in valid stage schedules with no larger register requirements. First, the heuristic detects all sink and source vertices, ignoring cut edges and selfloop edges. Then, the heuristic applies the up-propagation transform to all source vertices, visiting each edge at most once. Since a source vertex has no input edges, the up-propagation transform only decreases the additional skip factors of its output 129

edges. Thus, the register requirements cannot increase. Similarly, the heuristic applies down-propagation to all the sink vertices. In Figure 4.10a, load1 and load11 are source vertices and div5 and mult16 are sink vertices. However, as the minimum pi;j in each of the four cases is 0, the SS heuristic does not cause any change in this example.

4.4.5 Up Heuristic (UP) We now present a technique that enhances the impact of the SS heuristic by propagating the additional skip factors upward in the dependence graph. This heuristic has a linear computational complexity in the number of edges, and results in valid stage schedules. Note that this heuristic may increase the register requirements of a schedule, although that rarely occurs in our benchmark3. a) Initial Schedule

b) Improved Schedule

l11 l11 a12

a13

a14 a12







a13

m15

m15

a14





m16

m16

time



register and scheduling edge additional skip factor

Figure 4.12: UP propagation heuristic. We illustrate this heuristic by investigating one of the biconnected components of our running example, shown in Figure 4.12a. As previously mentioned, the SS In our benchmark suite of 1327 loops, the UP heuristic increased the register requirements of only three loops, and those by no more than 2. 3

130

heuristic cannot reduce the additional skip factors as both the minimum pi;j among the output edges of the source operation, load11, and the minimum pi;j among the input edges of the sink operation, mult16, are 0. However, p12;15, p13;15, and p14;16 can each be reduced by 1 without modifying the sum of the pi;j along the two underlying cycles and without resulting in negative pi;j values. This heuristic proceeds by applying the up-propagation transform (ignoring selfloop edges and cut edges) to the vertices of the dependence graph, in reverse topological order. In Figure 4.12a, the transform can be applied to operations add12, add13, and add14, and then to load11. Figure 4.12b shows the additional skip factors after completion of this heuristic. For strongly connected components, it may be advantageous to use this heuristic several times. A similar heuristic may be constructed using the down-propagation transform with the normal topological order.

4.4.6 Register Sink/Source Heuristic (RSS) This last heuristic is based on removing redundant register edges in the dependence graph, which helps make improvements of a stage schedule more apparent. This heuristic has a quadratic computational complexity in the number of register edges, as shown in Section 3.5, and results in valid stage schedules with no larger register requirements. Figure 4.13 illustrates the importance of removing redundant register edges, using the second biconnected component of Figure 4.10a. Figure 4.13a illustrates the dependence graph for this component before the elimination of redundant register edges. In this graph, the cumulative additional skip factor of the leftmost underlying cycle must be equal to 1. For example, either p1;2 or p2;4 may be 1, with the other pi;j = 0. Without careful inspection, one may conclude that either solution results in the same register requirements, as reducing one lifetime increases the other, since both p1;2 and p2;4 are associated with scheduling and register edges. However, by using the redundant edge removal technique presented in Section 3.5, we discover that the register edges (load1, add2) and (load1, div3) are redundant, 131

a) Initial schedule

b) Improved schedule

l1

l1

a2

d3

d3 a2

a4

a4



d5

time

d5

scheduling edge register edge additional skip factor

Figure 4.13: Register source heuristic example. as the register requirements generated by the register edge (load1, div5) cover the register requirements of these two edges. It is then apparent that minimizing p2;4, rather than p1;2, results in lower register requirement, since p1;2 is no longer associated with a register edge, as shown in Figure 4.13b. First, the heuristic removes all redundant edges, which has a quadratic computational complexity in the number of edges if the MinDist relation is provided. Computing the MinDist relation [49] corresponds to an all-pairs longest-path problem which has a cubic computational complexity in the number of operations (in each strongly connected component). However, this relation may be readily available, as some modulo schedulers use this relation to determine the minimum initiation interval. Then, the RSS heuristic applies the up-propagation transform to each reg-source vertex, i.e. vertices without input register edges (ignoring self-loop edges and cut edges). Although the up-propagation transform may increase the additional skip factor along the input edges, we know that none of these edges contributes to the overall register requirements. Thus, the register requirements cannot increase. We then apply the down-propagation transform to the reg-sink vertices, i.e. those without output register edges. 132

b) Improved schedule (II=6)

a) Dependence graph with improved additional skip factors

Time:

a0

+ 2*II l1 a10

d3 l11 a2

+ II a12

RSS,UP

a13

UP



a4



m15

a14

d5

+ 3*II

AC

+ II

0: 1: 2: * 8: * 11: 12: * 14: 15: 16: * 19: * 21: 22: * 25: * 29: 30:

Operations a0; l1; d3; a2; d5; a10; l11; a12,a13; a14; r6; m15; a4; m16; a7,a9; s8;

r6

m16

a9

a7

+ 4*II AC

register and scheduling edge (bold: edge in underlying cycle)

s8

scheduling edge only (bold: edge in underlying cycle)

time

non−zero additional skip factor

Figure 4.14: Improved stage-schedule (Example 4.5).

133

4.4.7 Combining Heuristics To summarize the bene ts of the stage scheduling heuristics presented in Section 4.4, we consider the schedules of Example 4.5 presented in Figures 4.10b and 4.14b. The new stage schedule in Figure 4.14 delays one operation by 4 stages (add9), one operation by 3 stages (add14), four operations by 2 stages (add10, load11, add12, add13), and three operations by 1 stage (add2, mult15, mult16), reducing the register requirements by 9, i.e. a reduction of 4 for vr9, 2 for vr14, 1 for vr2, vr12, vr13, and vr16, and an increase of 1 for vr11. Figure 4.14a also shows which stage scheduling heuristics produce each modi cation in the dependence graph. Combining the stage scheduling heuristics is advantageous, as several of these address distinct parts of the dependence graph. For example, using the AC heuristic in conjunction with the other stage scheduling heuristics is very advantageous as the AC heuristic focuses on the acyclic parts of the underlying dependence graph whereas the other heuristics focus on the cyclic parts of the underlying dependence graph. Using both the UP and the RSS heuristic is advantageous, as they address distinct features of underlying cycles. Also, in graphs with strongly connected components, it may be advantageous to use the UP heuristic several times, as the propagation of the pi;j values may require more than one pass before being reduced in the presence of strongly connected components. Note also that since the SS heuristic is fully covered by the RSS heuristic, there is no gain in using the SS heuristic if the RSS heuristic is used.

4.5 Schedule-Length Sensitive Stage Scheduling Since the stage scheduling approach preserves the initiation interval, it reduces the register requirements without lowering the steady-state throughput of a modulo schedule. However, the overall performance of a modulo schedule is also function of its transient performance, which is directly proportional to the schedule length (SL) of an iteration. And since the stage scheduling algorithms presented in Sections 4.3 and 4.4 modify the stages of the operations, these algorithms may increase SL, and 134

Schedule−length insensitive stage−scheduling

Schedule−length sensitive stage−scheduling

a) Initial schedule

c) Initial schedule

b) Modified schedule

d) Modified schedule

start

l0

l4

l0

l0

start

l4

l0

m1

d5

m1

m1

d5

l4

m1

l4

a2

d5

s6

s3

a2



a2

d5

s3

s6

s3

a2

s6

s3

stop

s6

stop

time operations

pseudo operations

scheduling edge

non−zero additional skip factor

register and scheduling edge

Figure 4.15: Stage scheduling and schedule length (Example 4.6). thus degrade the overall performance of the schedule. In this section, we present a technique that alters the original dependence graph such that stage-scheduling algorithms may not increase the schedule length of an iteration when searching for a schedule with reduced register requirements.

Example 4.6 We illustrate the e ect of stage scheduling on the schedule length

using the following example: z[i] = 3*x[i]+y[i] and w[i] = 1/y[i], computed as presented in Figure 4.15. A modulo schedule for this example is shown in Figure 4.15a, where each operation is scheduled as early as possible to minimize the schedule length. This schedule does not minimize the register requirements, as the value de ned by the load4 operation is alive for three stages before being used by the add2 operation, i.e. p4;2 = 3. Since there is no underlying cycle in the dependence graph, a schedule 135

with minimum register requirements is simply obtained by setting all the additional skip factors (pi;j ) to 0, as shown in Figure 4.15b. It is clear from this gure that while this transformation minimizes the register requirements, it may also increase the schedule length of the modulo schedule. The main idea in achieving schedule-length sensitive stage-scheduling is to introduce additional scheduling constraints into the dependence graph such that the schedule length may not be increased by the scheduling algorithm without violating some scheduling constraints. To express these additional constraints easily, we introduce a start and a stop vertex which are, respectively, a predecessor and successor to every other vertex in the dependence graph. The additional scheduling constraints are then simply formulated by an additional scheduling edge from the stop vertex to the start vertex associated with the latency lstop,start = ?(SL ? 1), where SL is the desired schedule length. In general, the start and stop vertices must each be connected to every other vertex of the dependence graph; however, super uous scheduling edges can be removed using Theorem 3.2. The latencies associated with these additional edges vary depending on the de nition of the schedule length. We consider here the scheduling length as the distance from the cycle in which the earliest operation is scheduled to the cycle in which the computation of the last value is completed (i.e. its schedule cycle + its latency). Thus, for each operation i, the latency of edge (start; i) is set to 0 and the latency of edge (i; stop) is set to the latency of operation i. The modi ed dependence graph associated with Example 4.6 is shown in Figure 4.15c. Note that each operation is now part of an underlying cycle. In Figure 4.15c, we may delay the store6 operation by one stage without increasing the schedule length, since p6;stop = 1. This fact can be exploited by a stage-scheduler to reduce the lifetime along edge (4; 2) by one stage, i.e. by II cycles, as shown in Figure 4.15d. Note that this solution has the same SL as the initial schedule in Figure 4.15a, but that the modi ed schedule in Figure 4.15b achieves a greater reduction in register requirements by increasing SL. 136

4.6 Measurements In this section we investigate the register requirements of the integer and oating point register les for a benchmark of 1327 loops obtained from the Perfect Club [13], SPEC-89 [91], and the Livermore Fortran Kernels [65] compiled for the Cydra 5 as described in Section 2.7. We use here the Iterative Modulo Scheduler [78] to produce high quality MRT-schedules, which are used as input to both the optimal stage scheduling algorithm and the stage scheduling heuristics. Thus, in this section, every schedule of a loop shares the same MRT and di ers only by the stage in which each operation is placed. The performance of the following stage scheduling algorithms is investigated here.

MinReg Stage Scheduler. This stage scheduler minimizes MaxLive, the maximum number of live values at any single time in the loop schedule, over all valid modulo schedules that share a given MRT. The resulting schedule has the lowest achievable register requirements for the given machine, loop, and MRT. For dependence graphs with underlying cycles, the complexity of this scheduler may require the solving of an exponential number of LP-problems.

MinBu Stage Scheduler. This stage scheduler minimizes the integer MaxLive over all valid modulo schedules that share a given MRT. The resulting schedule has the lowest achievable bu er requirements for the given machine, loop, and MRT. We present here the register requirements associated with these schedules in our comparisons. For dependence graphs with underlying cycles, the complexity of this scheduler requires the solving of a single LP-problem.

Stage Scheduling Heuristics. The stage scheduling heuristics used in this section

are all based on the four basic heuristics presented in Section 4.4, namely Acyclic (AC), Sink/Source (SS), Up (UP), and Register Sink/Source (RSS) heuristics. When combined, the name of the stage scheduling heuristic simply lists the basic heuristics in the order in which they are used. For example, the stage scheduling heuristic AC+3*UP consists of the one call to the Acyclic heuristic followed by three calls to the Up heuristic. 137

The register requirements of the schedules produced by the above algorithms are investigated for loops with unlimited schedule length in Section 4.6.1 and with bounded schedule length in Section 4.6.2. We then investigate the e ect of increasing II on the register requirements in Section 4.6.3. To facilitate comparison with previous work, the MRT-schedules used in Sections 4.6.1 and 4.6.2 correspond to the schedules computed by Rau and used in [78]. In Section 4.6.3 however, we use our implementation of the Iterative Modulo Scheduler to compute the required MRTschedules, since we also need MRT-schedules with larger initiation intervals.

4.6.1 Schedules with Unlimited Schedule Length We rst investigate the register requirements of stage schedules with unlimited schedule length. The characteristics of the input data dependence graphs and MRTschedules for the 1327 loop benchmark are shown Table 4.2. Measurements: II II=MII operations operations in underlying cycles underlying cycles register edges: without redundant edges scheduling edges: without redundant edges

min 1:00 1:00 2:00 0:00 0:00 1:00 1:00 1:00 1:00

freq median average max 28:6% 3:00 11:51 165:00 96:0% 1:00 1:01 1:50 0:4% 10:00 17:54 161:00 42:4% 4:00 10:18 158:00 55:4% 0:00 3:01 66:00 0:4% 12:00 22:30 220:00 0:4% 11:00 21:35 220:00 0:4% 14:00 25:79 345:00 0:4% 12:00 22:47 232:00

Table 4.2: Characteristics of the 1327-loop benchmark suite for schedules with unlimited schedule length. The data most relevant to the computational complexity of the optimal stage scheduling algorithm is the number of operations that belong to some nontrivial underlying cycle, the number of register edges, and the initiation interval. Table 4.2 indicates that, in this regard, the 1327-loop benchmark contains a wide range of loops, including loops with up to 158 operations in underlying cycles, 220 register edges, and 138

an II of up to 165 cycles. The high quality of the MRT-schedules is indicated by the fact that 96.0% of the loops scheduled by the Iterative Modulo Scheduler [78] achieve MII . Notice the fraction of operations in underlying cycles. If the workload is \equal number of iteration" from each loop in the benchmark, 57.9% of the operations belong to some underlying cycle. If the workload is \equal number of nodes" from each loop, 37.4% of the operations belong to an underlying cycle. The di erence between these two percentages occurs because smaller loops have a lower percentage of their operations in underlying cycles. If operations with self-loops are also considered as underlying cycles, these percentages become 76.3 and 62.5%, respectively. Table 4.2 also indicates the e ectiveness of the dependence graph transformation presented in Section 3.5: when redundant register and scheduling edges are eliminated from the dependence graphs, the average number of edges decrease by 4.3 and 12.9%, respectively. When searching for optimal stage schedules with unlimited schedule length, MinBu Stage-Scheduler nds a schedule with minimum integral MaxLive for all 1327 loops. Similarly, MinReg Stage-Scheduler nds a schedule with minimum MaxLive for 1296 of the 1327 loops; for the remaining 31 loops, the scheduler exceeded an upper limit of 512 augmented problems solved in Algorithm 4.2, Step 4. Our measurements include these loops as well, for which we report the schedules with the lowest register requirements among all stage-schedules considered. Our rst measurement investigates the register requirements achieved by the MinReg Stage-Scheduler, which results in the lowest register requirements over all valid modulo schedules that share a common MRT, and compares the results to a registerinsensitive scheduling algorithm and a schedule independent lower bound. This measurement provides insight regarding the range of decrease in register requirements that can be achieved by the stage scheduling approach. For a machine with any given number of registers, Figure 4.16 presents the fraction of the loops in the benchmark suite that can be scheduled without spilling and without increasing II . In this graph, the X-axis represents MaxLive and the Y-axis represents 139

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 0.5 0.4 0.3

Schedule-Independent Lower Bound

0.2

MinReg Stage-Scheduler

0.1

Iterative Modulo Scheduler

0 0

16

32

48

64

80 96 MaxLive

112

128

144

160

Figure 4.16: Register requirements in the 1327-loop benchmark suite. the fraction of loops scheduled. Thus, for example, if 32 registers were available, the MinReg Stage-Scheduler could schedule 58% of the loops without spilling and without increasing II . Alternatively, to schedule 80% of the loops successfully with the chosen II and no spilling, the MinReg Stage-Scheduler needs at least 45 registers. The second curve in Figure 4.16, labeled \Schedule Independent Lower Bound," corresponds to MinAvg in [49] and is a lower bound on the register requirements formed by assuming a machine with unlimited issue width and unlimited functional resources. The lower bound is obtained by computing the MinDist relation and by assuming that no register lifetime spans more than the minimum scheduling distance (i.e. MinDist) between its def and its use-operations. Notice the gap between the MinReg Stage Scheduler curve and the Lower Bound curve, which we believe is caused by two factors. First, the MinReg Stage Scheduler searches for a schedule with the lowest register requirements only among the modulo schedules that share a given MRT. Therefore, when the MRT given to the stage 140

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 0.5 0.4

Best Heuristic

0.3

MinBuff Stage-Scheduler

0.2

Simple Heuristic

0.1

Iterative Modulo Scheduler

0 0

4

8

12

16

20 24 28 32 36 Additional MaxLive

40

44

48

52

Figure 4.17: Additional register requirements relative to the MinReg Stage-Scheduler. scheduler is suboptimal, the stage scheduler will nd the fewest registers required to support that MRT, which may be signi cantly larger than the absolute minimum register requirements over all MRTs for that loop4. The second factor is that the schedule-independent lower bound may be signi cantly too optimistic due to the complexity of some of the loops in the benchmark suite and the fact that this lower bound assumes unlimited issue width and unlimited functional resources. The third curve presents the register requirements of the Iterative Modulo Scheduler, which presently does not attempt to minimize the register requirements. The gap between these last two curves indicates the degree of improvement that stage scheduling heuristics can potentially achieve, i.e. they should do better than the Experimental evidence on a subset of this benchmark suite, consisting of 920 loops with up to 41 operations and II up to 118, for which an optimal modulo schedule was sought over all MRTs (see Chapter 5) showed, however, that the di erence between the local and absolute minimum was surprisingly small. This result may not hold for larger loops. 4

141

Iterative Modulo scheduler and cannot do better than the MinReg Stage-Scheduler. Our second measurement illustrates the additional registers needed when scheduling algorithms other than the MinReg Stage Scheduler are used. This measurement allows us to precisely evaluate the performance, in register requirements, of the various stage scheduling heuristics investigated in this chapter. From top to bottom, the curves in Figure 4.17 present the additional registers needed by the Best Heuristic5, MinBu Stage-Scheduler, a Simple Heuristic6, and the Iterative Modulo Scheduler. The rst interesting observation is that the Best Heuristic has lower additional register requirements than the MinBu Stage-Scheduler. As indicated in Figure 4.17, the Best Heuristic nds a schedule with no additional register requirements in 95.9%, versus 92.5% for the MinBu Stage-Scheduler; also, the Best Heuristic schedules 98% of the loops with 1 additional register, versus 3 for the MinBu Stage-Scheduler. This observation can be explained by the fact that, since the MinBu Stage-Scheduler only minimizes integral MaxLive, it arbitrarily picks one of the stage schedules with minimum integral MaxLive, disregarding fractional MaxLive. As indicated by the measurements in Figure 4.17, this arbitrary pick results in schedules with higher register requirements than the schedules found by the Best Heuristic. The second observation is that even a simple heuristic signi cantly decreases the register requirements, below those of a register-insensitive scheduler. In Figure 4.17, the Simple Heuristic nds a stage schedule with no additional register requirements in 80.0% of the loops, and schedules 95% of the loops with no more than 9 additional registers. The Iterative Modulo Scheduler results in a stage schedule with no additional register requirements in 42.0% of the loops, a surprisingly high number for a scheduler that does not even attempt to minimize the register requirements. This result can be explained in part by the fact that this scheduler attempts to minimize the length of a schedule, which generally results in a stage schedule with low register requirements along the critical path of that schedule. However, the Iterative Modulo The best heuristic (AC+3*UP+RSS) uses the AC heuristic, followed by three passes of the UP heuristic, followed by the RSS heuristic. Additional passes of the UP heuristic only marginally decrease the register requirements. 6 The simple heuristic (AC+SS) uses the AC heuristic followed by the SS heuristic. 5

142

1 0.9 Fraction of loops scheduled

0.8 0.7 Heuristic AC+3*UP+RSS

0.6

Heuristic AC+3*UP

0.5

Heuristic AC+UP

0.4

Heuristic AC+RSS 0.3

Heuristic AC+SS

0.2

Heuristic AC

0.1

Iterative Modulo Scheduler

0 0

1

2

3

4

5 6 7 8 Additional MaxLive

9

10

11

12

Figure 4.18: Additional register requirements of stage scheduling heuristics. Scheduler needs up to 23 additional registers to successfully schedule 95% of the loops, indicating that more than 5% of the loops have a signi cant number of operations that are not on the critical path and are scheduled too early, or too late. Our third measurement investigates in more detail the additional register requirements of several stage scheduling heuristics. To facilitate the reading of Figure 4.18, the heuristics are listed in the legend in order of decreasing performance. The bottom curve corresponds to the performance of the Iterative Modulo Scheduler. The next curve corresponds to the AC heuristic, which nds a stage schedule with no additional register requirements in 73.9% of the loops, and schedules 95% of the loops with no more than 10 additional registers. This signi cant decrease in register requirements due to the AC heuristic is solely obtained by moving the unmodi ed schedules of the biconnected components of the underlying dependence graph to be as close to one another as possible. To decrease the register requirements of the modulo scheduled loops further, the schedule within each biconnected component must also be 143

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 0.5 0.4 Schedule-Independent Lower-Bound

0.3

Iterative Modulo Scheduler

0.2

Heuristic AC+3*UP+RSS 0.1

MinReg Stage-Scheduler

0 0

10

20

30

40 50 Schedule Length

60

70

80

Figure 4.19: Schedule length of one loop iteration in the 1327-loop benchmark suite. considered. The next curve corresponds to the AC+SS heuristic, which also reschedules those vertices in the dependence graph that have no input edges or no output edges (ignoring self-loop and cut edges). The AC+SS heuristic results in lower register requirements than the AC heuristic in the range from 1 to 10 additional registers and is nearly indistinguishable after that. The next curve corresponds to the AC+RSS heuristic, which performs better than the two previous heuristics in the range from 1 to 10 additional registers. This heuristic requires the removal of redundant register edges, which is accomplished in quadratic time, provided that the MinDist relation is available in a lookup table. We notice that the AC, AC+SS, and AC+RSS heuristics each schedule approximately 95% of the loops with no more than 10 additional registers. To improve the register requirements further, we must more aggressively address the biconnected components using heuristics based on the up-propagation transform. The last three mostly overlapped curves of Figure 4.18 correspond to the per144

formance of the AC+UP, AC+3*UP, and AC+3*UP+RSS heuristics. These three heuristics perform signi cantly better, nding an optimal stage schedule for 92.9%, 95.0%, and 95.9% of the loops and scheduling 98% of the loops with no more than 3, 1, and 1 additional registers, respectively. Although these three curves mostly overlap, the AC+3*UP+RSS heuristic does perform slightly better, with additional register requirements of 94 registers, instead of 110 (AC+3*UP) and 241 (AC+UP), summed over all 1327 loops of the benchmark suite. Our next experiment investigates the impact of the stage scheduling algorithms on the schedule length (SL), which determines the transient performance of the modulo schedules. The distribution of SL is shown in Figure 4.19 for the Iterative Modulo Scheduler, the MinReg Stage-Scheduler, and the AC+3*UP+RSS heuristic. The graph also includes a lower bound on the schedule length of a schedule, labeled \Schedule-Independent Lower Bound," that corresponds to the MinDist relation between the start and stop vertices, as de ned in Section 4.5. Comparing the schedule length of the MinReg Stage-Scheduler to the Iterative Modulo Scheduler, the stage scheduler reduces SL by 1 cycle for 2 out of 1327 loops, i.e. for 0.2% of the loops. It results in the same SL in 63.7% of the loops, increases SL by 1 cycle in 12.0%, by 2 in 5.8%, by 3 to 10 cycles in 10.2%, and by up to 155 cycles in the remaining 9.1% of the loops, resulting in a median of 0 and an average of 2.72 additional cycles. The average increase of 2.06 additional cycles in schedule length for the AC+3*UP+RSS is slightly smaller. A better metric for the increase in schedule length is the ratio of SL to the schedule length lower bound. This ratio is given in Table 4.3 for the Iterative Modulo Scheduler and each stage scheduling algorithm. Compared to the Iterative Modulo Scheduler, the MinReg Stage-Scheduler increases the average schedule length ratio by 5.6%, while decreasing the average register requirements by 19.8%. Similarly, the stage scheduling heuristics increase the average schedule length ratio by up to 4.6%, while decreasing the average register requirements by up to 19.6%. Ultimately, the best performance metric is the total execution time associated with a schedule, which is a function of II , SL, and pro ling information (i.e. the 145

Measurements: Iterative Modulo Scheduler: MaxLive Schedule Length (ratio) Execution Time (ratio) MinReg Stage Scheduler: MaxLive Schedule Length (ratio) Execution Time (ratio) AC+3*UP+RSS Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) AC+3*UP Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) AC+UP Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) AC+RSS Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) AC+SS Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) AC Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio)

min freq median average

max

2:00 0:1% 1:00 45:3% 1:00 52:9%

38:00 1:03 1:00

34:36 166:00 1:08 2:47 1:03 1:48

2:00 0:1% 1:00 38:1% 1:00 48:1%

29:00 1:05 1:00

27:54 133:00 1:15 2:96 1:06 1:69

2:00 0:1% 1:00 38:4% 1:00 48:2%

29:00 1:04 1:00

27:61 133:00 1:13 2:96 1:06 1:69

2:00 0:1% 1:00 38:4% 1:00 48:2%

29:00 1:04 1:00

27:62 133:00 1:13 2:96 1:06 1:68

2:00 0:1% 1:00 38:5% 1:00 48:2%

29:00 1:04 1:00

27:72 133:00 1:13 2:96 1:06 1:68

2:00 0:1% 1:00 38:6% 1:00 48:2%

29:00 1:04 1:00

28:33 135:00 1:13 2:96 1:06 1:68

2:00 0:1% 1:00 38:6% 1:00 48:2%

29:00 1:04 1:00

28:73 135:00 1:13 2:96 1:06 1:68

2:00 0:1% 1:00 38:9% 1:00 48:2%

29:00 1:04 1:00

29:15 135:00 1:12 2:96 1:06 1:68

Table 4.3: Characteristics of the stage scheduling algorithms with unlimited schedule length.

146

total number of times the loop is entered and executed. We investigate here the ratio of the execution time to the execution time lower bound for the 597 loops for which the pro ling data was available to us. The execution times are computed using Equation (3.1), using II and SL to compute the actual execution time of the schedules, and MII and the schedule length lower bound to compute the execution time lower bound. These ratios, shown in Table 4.3, indicate that using the stage scheduling algorithms (with unlimited schedule length) instead of the Iterative Modulo Scheduler results in an increase in the average execution time ratio of 2.9%. Algorithms: Redundant edges: average maximum std dev.

AC+SS

AC+3* MinBu MinReg UP+RSS yes no yes no yes no 0.1 ms 0.4 ms 141.8 ms 112.2 ms 43.9 sec 23.5 sec 3.3 ms 12.7 ms 18691.7 ms 17558.4 ms 7180.0 sec 6495.5 sec 0.1 ms 0.6 ms 950.9 ms 816.6 ms 374.0 sec 268.4 sec

Table 4.4: Run time for the stage-scheduling algorithms on a Sun Sparc-20 workstation. Runtime statistics gathered by executing several stage scheduling algorithms on a SUN Sparc-20 workstation are given in Table 4.4. Notice the decrease in runtime when redundant edges are eliminated from the dependence graph. For example, the MinReg Stage-Scheduler completes in 53.5% of time, on average, when redundant edges are eliminated using the technique described in Section 3.5. Note also that the average run time of both heuristics is 2 to 3 orders of magnitude faster than the average run time of the MinBu Stage-Scheduler and 4 to 5 orders of magnitude faster than the MinReg Stage-Scheduler.

4.6.2 Schedules with Bounded Schedule Length We now investigate the register requirements of stage schedules when the schedule length of a loop iteration is bounded by the schedule length initially obtained by the Iterative Modulo Scheduler. The results presented here are thus indicative of the 147

decrease in register requirements achieved while maintaining, or possibly increasing, the performance of the resulting schedules, since II is kept constant and SL cannot increase. Measurements: II II=MII operations operations in underlying cycles underlying cycles register edges: without redundant edges scheduling edges: without redundant edges

min freq median average max 1:00 28:6% 3:00 11:51 165:00 1:00 96:0% 1:00 1:01 1:50 4:00 0:4% 12:00 19:54 163:00 4:00 0:4% 12:00 19:54 163:00 1:00 3:8% 5:00 11:10 116:00 1:00 0:4% 12:00 22:30 220:00 1:00 0:4% 11:00 21:30 220:00 4:00 0:4% 33:00 59:88 601:00 4:00 0:4% 19:00 32:83 287:00

Table 4.5: Characteristics of the 1327-loop benchmark for schedules with bounded schedule length. The characteristics of the input data dependence graphs (augmented by the start and stop vertices) and MRT-schedules for the 1327 loop benchmark are shown in Table 4.5. The major di erence with respect to the characteristics of schedules with unlimited schedule length shown in Table 4.2, is that each operation is now part of some underlying cycle. Other di erences are the increased number of underlying cycles, 11.10 versus 3.01 cycles, and the increased number of scheduling edges, 32.83 versus 22.47 nonredundant edges, on average. Again, eliminating the redundant register and scheduling edges from the dependence graph is e ective, decreasing the average number of edges by 4.5 and 45.1%, respectively. In this section, the MinBu Stage-Scheduler nds a schedule with minimum integral MaxLive for all 1327 loops, and the MinReg Stage-Scheduler nds a schedule with minimum MaxLive for 1246 of the 1327 loops. For the remaining 88 loops, we report in our measurements the schedules with the lowest register requirements among all stage-schedules considered. It is not surprising that the MinReg Stage-Scheduler nds fewer optimal schedules when the schedule length is bounded since the solver is 148

very sensitive to the number of underlying cycles. Note also that since each vertex is part of an underlying cycle, the AC heuristic is not used here as it impacts only the acyclic parts of the dependence graph. Similarly, the SS heuristic is not used here since it does not apply to vertices in a strongly connected component, where each vertex has at least one input and one output edge. The rst measurement investigates the increase in register requirements that is incurred when algorithms with bounded schedule length are used instead of the optimal stage scheduling algorithm with unlimited schedule length. Figure 4.20 illustrates the additional register requirements of the bounded MinReg Stage Scheduler and the bounded stage scheduling heuristics, relative to the MinReg Stage Scheduler with unlimited schedule length. The top curve of Figure 4.20 corresponds to the bounded MinReg Stage-Scheduler. This curve is interesting since it compares the optimal scheduling algorithm with bounded SL with the same algorithm using the unlimited SL. This top curve is thus indicative of the additional register requirements that are strictly due to the bound on SL. As shown in the graph, the bounded optimal stage scheduler nds a schedule with the same number of registers for 81.2% of the loops and requires up to 1 additional register to schedule 95% of the loops. The other curves in Figure 4.20 correspond to the stage scheduling heuristics with bounded schedule length, and the Iterative Modulo Scheduler, which produced that schedule length, shown for comparison. These heuristics still result in a signi cant decrease in register requirements below Iterative Modulo Scheduling. For example, 95% of the loops can be scheduled with 2, 3, and 7 additional registers for the 3*UP+RSS, 3*UP, and RSS bounded scheduling heuristic. Our next measurement evaluates more directly the e ectiveness of the bounded heuristics, by comparing their register requirements to the bounded MinReg StageScheduler. This comparison is provided in Figure 4.21. Our rst observation is that the bounded heuristics achieve a signi cant fraction of the decrease in register requirements that is achieved by the bounded optimal stage scheduler, although they are not as close as the same scale as the unbounded heuristics are to the unbounded 149

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 MinReg Stage-Scheduling (bounded)

0.5

Heuristic 3*UP+RSS (bounded)

0.4

Heuristic 3*UP (bounded)

0.3

Heuristic UP (bounded) 0.2

Heuristic RSS (bounded)

0.1

Iterative Modulo Scheduling

0 0

1

2

3

4

5 6 7 8 Additional MaxLive

9

10

11

12

Figure 4.20: Additional register requirements with respect to the MinReg Stage-Scheduler with unlimited schedule length). 1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 0.5

Heuristic 3*UP+RSS (bounded)

0.4

Heuristic 3*UP (bounded)

0.3

Heuristic UP (bounded)

0.2

Heuristic RSS (bounded)

0.1

Iterative Modulo Scheduler

0 0

1

2

3

4

5 6 7 8 Additional MaxLive

9

10

11

12

Figure 4.21: Additional register requirements with respect to MinReg Stage-Scheduler with bounded schedule length.) 150

Measurements: Iterative Modulo Scheduler: MaxLive Schedule Length (ratio) Execution Time (ratio) Bounded MinReg Stage Scheduler: MaxLive Schedule Length (ratio) Execution Time (ratio) Bounded 3*UP+RSS Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) Bounded 3*UP Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) Bounded UP Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio) Bounded RSS Heuristic: MaxLive Schedule Length (ratio) Execution Time (ratio)

min freq median average

max

2:00 0:1% 1:00 45:3% 1:00 52:9%

38:00 1:03 1:00

34:36 166:00 1:08 2:47 1:03 1:48

2:00 0:1% 1:00 45:3% 1:00 52:9%

29:00 1:03 1:00

27:82 133:00 1:08 2:47 1:03 1:48

2:00 0:1% 1:00 45:3% 1:00 52:9%

38:00 1:03 1:00

28:03 166:00 1:08 2:47 1:03 1:48

2:00 0:1% 1:00 45:3% 1:00 52:9%

29:00 1:03 1:00

28:24 148:00 1:08 2:47 1:03 1:48

2:00 0:1% 1:00 45:3% 1:00 52:9%

29:00 1:03 1:00

28:32 148:00 1:08 2:47 1:03 1:48

2:00 0:1% 1:00 45:3% 1:00 52:9%

29:00 1:03 1:00

28:90 166:00 1:08 2:47 1:03 1:48

Table 4.6: Characteristics of the stage scheduling algorithms with bounded schedule length.

151

optimal stage scheduler. For example, the best bounded heuristic nds a schedule with no additional registers for 91.6% of the loops, versus 95.9% for the best unbounded heuristic. This e ect can be explained, in part, by the fact that two of the heuristics that are guaranteed not to increase the register requirements, i.e. AC and SS, do not apply to dependence graphs that are strongly connected. The second observation is that the RSS heuristic is much more e ective for bounded heuristics than it is for the heuristics with unlimited SL. For example, although 91.6% of the loops are scheduled without additional registers with the 3*UP+RSS heuristic, only 87.7% are with only the 3*UP heuristic. This di erence contrasts with the small improvement, 95.9 versus 95.0%, when AC+3*UP+RSS is used instead of AC+3*UP for the heuristics with unlimited SL. The characteristics of the stage scheduling heuristics with bounded schedule length are summarized in Table 4.6. Notice that the schedule length and the execution time ratios are constant for every algorithm. As a result, the bounded MinReg StageScheduler and the 3*UP+RSS heuristic achieve a decrease in register requirements of 19.0 and 18.4%, respectively, with no decrease in performance.

4.6.3 Schedules with Increased Initiation Interval Our next experiments investigate the reduction in register requirements that is achieved when increasing II . We use here our implementation of the Iterative Modulo Scheduler in order to obtain MRT-schedules with increased II . We rst present in Figures 4.22 and 4.23 the register requirements of the Iterative Modulo Scheduler and the MinReg Stage-Scheduler, respectively, when the initial value of II found by the Iterative Modulo Scheduler is increased by 0, 1, 2, 3, 4, 5, 10, and 21 additional cycles. For a xed number of registers, i.e. a vertical line in Figures 4.22 and 4.23, we may investigate the fraction of the loops that can be scheduled when increasing II by some amount. In Figure 4.23 for example, 58% of the loops can be scheduled by the MinReg Stage-Scheduler without increasing II on a machine with 32 registers. This percentage increases to 73% if II is increased by 152

1 0.9 Fraction of loops scheduled

0.8 II+21

0.7

II+10

0.6

II+5

0.5

I+4 0.4

II+3

0.3

II+2

0.2

II+1

0.1

II

0 0

16 32 48 MaxLive (for Iterative Modulo Scheduler)

64

Figure 4.22: Increasing II to reduce the register requirements using the Iterative Modulo Scheduler. 1 0.9 Fraction of loops scheduled

0.8 II+21

0.7

II+10

0.6

II+5

0.5

II+4 0.4

II+3

0.3

II+2

0.2

II+1

0.1

II

0 0

16 32 48 MaxLive (for MinReg Stage-Scheduler)

64

Figure 4.23: Increasing II to reduce the register requirements using the MinReg Stage-Scheduler. 153

no more than 1 cycle, 78% with no more than 2 cycles, 88% with no more than 5 cycles, and up to 93% with no more than 21 additional cycles. We may also interpret these graphs by investigating the increase in II that may be required to schedule a xed fraction of the loops for varying numbers of available registers. In Figure 4.23 for example, II must be increased by up to 21 cycles to schedule 75% of the loops if only 16 registers are available. However, an II increase of up to 5 cycles is sucient with 24 registers, by 2 cycles with 32 registers, and no II increase is needed with 42 registers. Note also that increasing II decreases the register requirements most signi cantly for schedules that initially require between 16 and 32 registers when the MinReg Scheduler is used, and between 16 and 48 registers when the Iterative Modulo Scheduler is used. Our next experiment investigates the average initiation interval that is achieved on machines with nite register les, assuming that no spill code is used. When the register requirements of a schedule exceed the size of the register le, a new MRTschedule and stage-schedule is sought with a larger II , until its register requirements t the size of the register le or more than 21 attempts have been made, i.e. the search for a schedule was abandoned if II was increased by up to 21 cycles without resulting in a schedule that ts within the register le size. The resulting initiation intervals represent an upper bound on the performance degradation due to nite register le since, in practice, spill code would also be introduced to reduce the register pressure. Since the 1327-loop benchmark suite contains a wide range of loops, we divided the benchmark into 4 disjoint subsets. The rst subset, Benchmark-16, corresponds to loops with low register requirements and includes the 992 loops that can be scheduled by the Iterative Modulo Scheduler with no more than 16 registers when increasing II by up to 21 cycles. The second subset, Benchmark-32, corresponds to loops with medium register requirements and includes the 218 loops not in Benchmark16 that can be scheduled with no more than 32 registers and 21 additional cycles. Benchmark-48 and Benchmark-64 are de ned similarly, resulting in 87 and 23 loops with no more than 48 and 64 registers, respectively. The 7 remaining loops of the 154

1327-loop benchmark require up to 114 registers (when II is increased by 21 cycles) and are not considered here.

reg  16 reg  32 reg  48 reg  64 reg < 1 Algo:

Benchmark-16 with 992 loops 13:07 12:14 12:12 12:11 9:70 9:34 9:34 9:34 9:20 9:01 9:01 9:01 8:94 8:92 8:92 8:92 8:92 8:92 8:92 8:92 IMS BH UH MR

Benchmark-32 with 218 loops | 22:63 19:53 19:14 19:11 16:85 16:06 16:04 16:03 15:81 15:65 15:65 15:65 15:56 15:56 15:56 15:56 IMS BH UH MR

reg  16 reg  32 reg  48 reg  64 reg < 1 Algo:

Benchmark-48 Benchmark-64 with 87 loops with 23 loops | | | | 34:03 28:63 28:16 28:07 | 26:87 25:33 25:28 25:26 36:13 31:43 31:04 31:04 24:93 24:93 24:93 24:93 27:70 27:70 27:70 27:70 IMS BH UH MR IMS BH UH MR

Table 4.7: Average II for machines with nite register les (where IMS is the Iterative Modulo Scheduler, BH is the bounded AC+3*UP+RSS heuristic, UH is the unbounded AC+3*UP+RSS heuristic, and MR is the unbounded MinReg Stage Scheduler). Table 4.7 presents the average II for the Iterative Modulo Scheduler (IMS), the bounded AC+3*UP+RSS heuristic (BH for Bounded Heuristic), the AC+3*UP+RSS heuristic with unlimited schedule length (UH for Unbounded Heuristic), and the MinReg Stage Scheduler (MR) within the four disjoint benchmark subsets considered, and showing how the achieved average II decreases in each subset as the register limit increases. Bold entries correspond to cases where a stage scheduler decreases the average II by more than 10%, compared to the register-insensitive algorithm. We may conclude from Table 4.7 that register sensitive scheduling algorithms have a signi cant impact on the initiation interval when no spill code is introduced and when 155

the register requirements of a loop are critical. For example, the average II decreases by 15.6%, from 22.63 to 19.11 cycles, when the MinReg Stage Scheduler is used instead of the Iterative Modulo Scheduler for the loops in Benchmark-32 on a machine with 32 registers. When the register requirements are not critical, e.g. scheduling loops from Benchmark-32 on a machine with 64 registers, register sensitive algorithms do not impact the average II , as expected. Note also that the average II achieved by the BH and UH heuristics are very close to the II of the optimal algorithm, within 2 and 0.5% for the bounded and unbounded heuristic, respectively.

reg  16 reg  32 reg  48 reg  64 reg < 1 Algo:

Benchmark-16 474 of 992 loops 1:90 1:73 1:73 1:73 1:22 1:14 1:14 1:14 1:13 1:07 1:08 1:78 1:04 1:03 1:04 1:04 1:03 1:03 1:03 1:03 IMS BH UH MR

Benchmark-32 70 of 218 loops | 1:72 1:56 1:57 1:57 1:22 1:17 1:18 1:18 1:09 1:07 1:09 1:09 1:05 1:05 1:05 1:05 IMS BH UH MR

reg  16 reg  32 reg  48 reg  64 reg < 1 Algo:

Benchmark-48 Benchmark-64 40 of 87 loops 8 of 23 loops | | | | 1:63 1:32 1:32 1:33 | 1:19 1:06 1:07 1:08 1:46 1:29 1:33 1:33 1:04 1:04 1:04 1:04 1:10 1:10 1:10 1:10 IMS BH UH MR IMS BH UH MR

Table 4.8: Average execution time ratio for machines with nite register les (where IMS is the Iterative Modulo Scheduler, BH is the bounded AC+3*UP+RSS heuristic, UH is the unbounded AC+3*UP+RSS heuristic, and MR is the unbounded MinReg Stage Scheduler). Table 4.8 presents the execution time ratios achieved by the algorithms for the 597 loops for which pro ling information was available. Interestingly, the bounded stage scheduling heuristic performs generally better (and never worst) than the unbounded 156

52 MinReg Stage-Scheduler

50 Average Schedule Length

3*UP+RSS (bounded)

48

Iterative Modulo Scheduler

46

AC+3*UP+RSS

44

Schedule-Independent Lower Bound

42 40 38 36 34 0

2

4

6

8

10 12 Additional II

14

16

18

20

Figure 4.24: Impact of increasing II on the schedule length. heuristic and the unbounded MinReg algorithms, as its slightly larger average II is more than o set by its lower SL due to bounding. Incidentally, increasing the initiation interval does moderately a ect the schedule length of the modulo schedules, as indicated by Figure 4.24. Note the consistent behavior of the unbounded scheduling algorithms, i.e. the average schedule length rst decreases, reaches a minimum with an additional II of 1 or 2 cycles, and then increases. This e ect appears to be an artifact of the unbounded scheduling algorithms, as the lower bound of the schedule-independent schedule-length, which corresponds to the minimum distance between the start and stop pseudo-operation introduced in Section 4.5, does not indicate a similar e ect, and the Iterative Modulo Scheduler and the bounded heuristic tend to keep reducing SL until they level o .

157

4.7 Summary Modulo scheduling is an ecient technique for exploiting instruction level parallelism, resulting in high performance code, but increased register requirements. In this chapter, we have investigated optimal algorithms and heuristics for stage scheduling which shifts the operations of a schedule by integer multiples of II cycles. The rst contribution of this chapter is an optimal stage scheduling algorithm, the MinReg Stage-Scheduler, which nds a schedule with minimum MaxLive over all valid modulo schedules that share a given MRT. This algorithm proceeds in linear time (in the number of edges) for loops whose dependence graphs have an acyclic underlying graph. Otherwise, the algorithm repetitively uses a linear-programming formulation to handle general dependence graphs with unrestricted loop-carried dependences and common subexpressions. The MinReg Stage-Scheduler uses a novel linear-programming formulation which focuses solely on those operations that are in an underlying cycle of the dependence graph. Because of this ecient representation, and by exploiting the special properties of the stage scheduling problem, a schedule with minimum MaxLive is found for 1296 of the 1327 loops found in a benchmark suite from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels as compiled for the Cydra 5. When applying this algorithm to the MRT-schedules produced by the register-insensitive Iterative Modulo Scheduler employed in our measurements, the average register requirements decrease by 19.8%. This chapter also contributes ecient stage scheduling heuristics that reduce the register requirements of a modulo schedule with much lower computational complexity. One of the proposed heuristics, referred to as the Acyclic heuristic (AC), produces in schedules in linear time that achieve the minimum MaxLive for dependence graphs that have acyclic underlying graphs. Super uous register requirements in the cyclic parts of the underlying graphs are speci cally addressed by the Register Sink/Source (RSS) and Up (UP) heuristics. By combining these basic heuristics, the AC+3*UP+RSS heuristic on average achieves 99.0% of the decrease in register 158

requirements obtained by the MinReg Stage-Scheduler in the 1327-loop benchmark suite, with a worst case complexity that is quadratic in the number of edges, provided that the MinDist relation is given as input. The AC+3*UP on average achieves 98.8% of the optimal decrease in register requirements with a complexity that is linear in the number of edges. Because the initiation interval of the given modulo schedules is preserved by the stage scheduling approach, the steady-state performance of a schedule is una ected by the stage schedulers. However, the stage scheduling algorithms may have an impact on the schedule length of a schedule, and thus alter the transient performance of its schedules. Experimental evidence gathered from the 1327 loop benchmark indicates that the average schedule length ratio increases by 5.6% for MinReg Stage-Scheduler, and by up to 4.6% for the stage scheduling heuristics. This increase in schedule length results, in turn, in an average execution time ratio that is 2.9% higher, for each of the stage scheduling algorithms over the 597 loops for which pro ling data was available to us. To prevent an increase in execution time, we also investigated the performance of the stage scheduling algorithms that search only for schedules with no larger schedule length than the schedule given as input. Experimental evidence from the 1327-loop benchmark shows that the stage schedulers with this bounded schedule length still achieve a large fraction of the possible decrease in register requirements. For example, the MinReg Stage-Scheduler with bounded schedule length achieves 95.9% of the decrease in register requirements that is achieved by the MinReg Stage-Scheduler with unlimited schedule length. The bounded 3*UP+RSS stage scheduling heuristic achieves 92.8% of the decrease in register requirements with a quadratic worst time complexity, and the bounded 3*UP heuristic achieves 89.7% of the decrease with a linear time complexity. To summarize, the stage scheduling approach proposed in this chapter allows a compiler to rst search for an ecient modulo schedule, focusing exclusively on nding a schedule with the smallest feasible II and schedule length. Then a stage scheduling pass signi cantly reduces the register requirements without impact on the 159

performance of the schedule, when a bounded stage scheduling algorithm is used. The bounded 3*UP+RSS heuristic is clearly the best stage schedule heuristic, decreasing the average register requirements by 18.4% compared to a register-insensitive modulo scheduler, and maintaining both the steady-state and transient performance of the original schedules. This heuristic should de nitely be implemented when compiling for machines in which the register requirements of modulo scheduled loops are critical.

160

CHAPTER 5 Modulo Scheduling Algorithms In the previous chapter, we focused on a modulo scheduling approach that distinguished between the MRT-scheduling and the stage-scheduling steps. While solving these two steps separately enabled us to develop ecient stage-scheduling algorithms, the generated schedules may not, in general, have a maximum throughput (II chosen by a modulo scheduling heuristic is not always the minimum possible) or minimum register requirements (since only one MRT is considered). This chapter presents an optimal register-sensitive modulo scheduling algorithm that jointly searches for an MRT-schedule and a stage-schedule, nding a schedule with the highest steady state throughput over all modulo schedules (minimum II ), and the minimum register requirements among all such schedules. Our scheduling method can satisfy any arbitrary dependence constraints for machines with nite resources and arbitrary reservation tables, and it minimizes MaxLive, the maximum number of live values at any single cycle of the loop schedule. When computing the register requirements of a schedule containing predicated operations, we assume that each predicate value is independent of the others. Further reducing the register requirements of schedules by investigating the relations among predicate values is postponed to Chapter 6. The technique used in this chapter is based on an integer linear programming formulation and it capitalizes on recent advances in integer linear programming for scheduling problems. The contributions of this chapter are threefold. First, we con161

tribute a formulation of the resource constraints for machines with nite resources and arbitrary reservation tables. This formulation enables us, for example, to precisely handle the resource requirements of the Cydra 5 [11], a machine with complex resource requirements. Second, we contribute a precise formulation of the register requirements of a modulo schedule. By using this precise model, instead of minimizing some approximation of the actual register requirements, we nd schedules with lower register requirements for a signi cant fraction of the 1327 loops in our benchmark suite (between 13.0% and 21.4%, depending on the approximation). Third, we propose a more ecient formulation of the scheduling constraints, which enables us to increase the number of loops successfully scheduled for minimum MaxLive from 781 to 920 loops. In this chapter, the related work is presented in Section 5.1 and the impact of modulo scheduling on the register requirements is illustrated in Section 5.2. A modulo scheduling algorithm for minimum register requirements is developed in Section 5.3 and this algorithm is extended to nd schedules with maximum throughput in Section 5.4. We demonstrate in Section 5.5 that the search space of register-sensitive modulo scheduling is bounded. Our experimental data is presented in Section 5.6 and our conclusions in Section 5.7.

5.1 Related Work Searching for modulo schedules with the highest steady-state throughput over the entire space of modulo schedules was rst formulated using an integer linear programming formulation by Hsu [48]. This formulation was obtained for arbitrary dependence graphs on machines with nite resources and simple resource reservation tables. Ning and Gao proposed an algorithm based on a linear programming formulation that results in a schedule with the highest steady-state throughput over all modulo schedules, and minimum average lifetime among such schedules for a machine with an unlimited number of resources [70]. In their work, and elsewhere [44][93], the register 162

requirements of a schedule are approximated by conceptual FIFO bu ers that are reserved for an interval of time that is a multiple of II . Their scheduling algorithm handles loop iterations with arbitrary dependence graphs. Ning and Gao's results were extended for machines with nite resources and fully-pipelined functional units by Govindarajan et al [44] using an algorithm based on an integer linear programming formulation that directly minimizes the bu er requirements of a schedule. Altman et al [7] have independently developed a formulation for machines with complex resource requirements. Their formulation simultaneously enforces the nite resource constraints and performs the mapping of operations to speci c functional units. For example [7], in a loop with 3 divide operations on a machine with 2 divide functional units, the mapping problem determines at compile time which divide operation is executed on which divide functional unit. Their approach formulates the joint nite resource and mapping problem as a coloring problem, using N 2 binary variables, N integer variables, and a number of constraints on the order of N 2 II jQj, where N is the number of operations in a loop iteration and Q is the set of machine resources. Their approach is very useful when scheduling for machines for which a mapping can not be trivially found, i.e. for machines that do not satisfy the mapping-free property de ned in Section 5.3.2. Otherwise, there is a class of mapping-free machines for which a mapping can be trivially found by unrolling the resulting schedule some nite number of times, and using a technique similar to modulo variable expansion [55]. For these machines, their approach does not require unrolling but introduces additional constraints on the scheduling space. Thus, their approach may result in schedules with more compact code (since no unrolling is required to nd a feasible mapping) and potentially lower steady-state performance (due to the additional constraints on the scheduling space that can be avoided by unrolling the resulting schedule). For machines where a trivial mapping can be found without unrolling, a formulation like the one proposed in this chapter should be used, since it is more concise (as it introduces no additional variables and uses II  jQj constraints), more ecient (since its constraints satisfy the structured property de ned in Section 5.3.4), and 163

handles the resource requirements of machines with arbitrary reservation tables. Altman [6] has independently derived a model that minimizes the register requirements without requiring loop unrolling by jointly scheduling operations and allocating lifetimes to physical registers. This result is obtained by performing as many copy operations as necessary for lifetimes that span several stages. His formulation is based on a coloring problem that uses a large number of additional variables and constraints. In this chapter, we handle the register requirements by using a formulation that may require unrolling the loop, but does not require any additional copy operation. The proposed formulation is concise (because it introduces only the minimum number of variables required to model the lifetimes of each operation in each cycle of the MRT) and is ecient (since the two types of constraints that are used to model the register requirements are structured). Ning and Gao's polynomial-time algorithm [70] has also been used by Dupont de Dinechin [29] to nd modulo schedules with low register requirements. He proposed a scheduling heuristic that solves a linear programming model to compute the ranges of cycles in which to schedule each operation, nds a contention free cycle for one operation, xes the scheduling time of that operation in the linear programming formulation, and continues with the next operation. In this work, he also proposed an ecient technique to compute RecMII , the minimum initiation interval due to recurrences in the dependence graph, as well as a lower bound on the initiation interval due to the register requirements. Although this algorithm uses a linear programming solver, it does not necessarily result in schedule with maximum throughput or minimum register requirements, since it uses a heuristic to nd a contention free schedule for a machine with nite resources.

164

5.2 Modulo Scheduling and Register Requirements In this section, we illustrate the impact of stage scheduling on the register requirements with an example for the target processor summarized in Table 3.1 with three general-purpose functional units and the register model described in Section 3.2 with an additional reserved time of one cycle.

Example 5.1 This example illustrates the impact of modulo scheduling on the reg-

ister requirements of a modulo-scheduled loop. This kernel is: y[i] = a and is computed as shown in the dependence graph of Figure 5.1a.

x[i]2- x[i]-

The scheduler places each operation of an iteration so that both resource constraints and dependence constraints are ful lled. Figure 5.1b illustrates a schedule with an II of 2 for the kernel of Example 5.1 on the target machine. In this schedule, the load, mult, add, sub and store operations of the iteration starting at time 0 are respectively scheduled at time 0, 1, 1, 5, and 6. The modulo reservation table (MRT) associated with a schedule is obtained by collapsing the schedule for an iteration to a table of II rows, using wraparound. Figure 5.1c illustrates the MRT associated with the schedule of Figure 5.1b. The resource constraints of a modulo schedule are satis ed if and only if the packing of the operations within the II rows of the MRT does not exceed the resources of the machine. For our target machine, the resource constraints allow up to 3 operations of any kind to be issued in each row of the MRT. The MRT can also model specialized resources by allowing only certain types of operations in each column. The virtual register lifetimes associated with this iteration are presented in Figure 5.1d. Recall that the black bars correspond to the initial portion of the lifetime which minimally satis es the latency of the de ning functional unit. The dependence constraints are satis ed if and only if no operation is placed during the portion of the lifetime that overlaps with the black bar of a virtual register that it uses. 165

a) Dependence graph

ld

b) Schedule (II=2)

Time: 0

ld

1

*

d) Lifetimes vr0 vr1 ld

+

vr2

vr3

+

*

2

+

*

3 4 −

5



− st

st

6 7

st c) MRT

e) Register requirements vr0

Time: 0,2,4,6

ld

1,3,5,7

*

vr1

vr2

vr3

st +



Figure 5.1: Modulo schedule with add scheduled early (Example 5.1). a) Schedule (II=2)

Time: 0

ld

1

*

c) Lifetimes vr0 ld

vr1

vr2

Latency portion of the lifetime * Additional lifetime

+

+

2

vr3

Schedule of one iteration

3 4 −

5

− st

st

6 7

b) MRT

d) Register requirements vr0

Time:0,2,4,6

ld

+

1,3,5,7

*



vr1

vr2

vr3

Integral part

st Fractional part

Figure 5.2: Modulo schedule with add scheduled late (Example 5.1). 166

Formal Problem De nition. Among all modulo schedules that employ the smallest

feasible initiation interval and satisfy all resource constraints, nd one that minimizes MaxLive. The schedule presented in Figure 5.1b is a schedule that results in the minimum bu er requirements or integral MaxLive for this kernel and target machine. However, minimizing bu er requirements or integral MaxLive does not in general result in a schedule with minimal MaxLive, because the fractional MaxLive may be excessive. For example, Figure 5.2 presents a schedule for the same kernel, resulting in the same integral MaxLive but resulting in a lower total MaxLive. The schedule of Figure 5.2a di ers only by the schedule time of the add operation, which is delayed by one cycle from time 1 to time 2. The register requirements associated with this schedule are shown in Figure 5.2d. Comparing the register requirements of this schedule to those of Figure 5.1e, we see that both have an integral MaxLive of four. However, this schedule results in a lower fractional MaxLive as the fractional parts are more evenly distributed among the II rows. As a result, although the bu er requirements is eight in both cases, MaxLive decreases from eight to seven.

5.3 Scheduling for Minimum MaxLive In this section, we present an integer linear programming formulation that eciently schedules the operations of an innermost loop for machines with nite resource requirements while minimizing the register requirements associated with the schedule. In the rst part of this section, we describe the modulo schedule search space for machines with nite resources and arbitrary reservation tables. We present here three types of constraints that must be satis ed by valid modulo schedules. The rst type corresponds the assignment constraints, which insure that each operation in the loop is assigned to exactly one cycle in the schedule. The formulation of the assignment constraints is presented in Section 5.3.1 using the formulation proposed by Govindarajan et al [44]. The second type represents the resource constraints, which insure that no cycle of the schedule requires more resources than are available in the 167

Machine: (input) Q Mq Resi;q li;j Graph: (input) G = fV; Esched ; Ereg g

!i;j N Schedule: (input) II Results: (output) ar;i

ki vr;i MaxLive

set of resource types. number of resources of type q. set of cycles in which operation i uses resource q. latency between operation i and operation j . dependence graph with scheduling edge set and register data ow edge set. dependence distance between operation i and operation j . number of operations (vertices) in graph G. initiation interval (input constant in the integer programming model). binary matrix where ar;i = 1 if operation i is scheduled in row r. stage number of operation i. number of live virtual registers in row r generated by operation i. maximum number of live virtual registers among all rows.

Table 5.1: Variables for the modulo scheduling model. machine. The formulation of the resource constraints for machines with arbitrary reservation tables is developed in Section 5.3.2 and is original to our work. The third type corresponds the scheduling constraints, which enforce arbitrary scheduling dependences between pairs of operations. A traditional formulation of the scheduling constraints is shown in Section 5.3.3 using the formulation devised by Govindarajan et al [44]. A novel and more structured formulation of the loop scheduling constraints is developed in Section 5.3.4. In the latter part of this section, we describe three objective functions that account for the register requirements or some approximation of the register requirements. The two rst objective functions, presented in Sections 5.3.5 and 5.3.6, compute the average lifetime and the bu er requirements associated with a schedule, respectively, and are similar to objective functions found in previous work [70][44][29][93]. The third 168

objective function computes MaxLive, the maximum number of registers required at any single cycle of the schedule. This objective function is presented in Section 5.3.7 and is also original to our work. We use in this section the notation described in Section 3.5 to represent a loop iteration, its dependence graph, and the machine latencies, using the same notation as described in Section 3.5. In particular, the dependence graph is represented by G = fV , Esched , Ereg g, where the sets of edges Esched and Ereg correspond, respectively, to the scheduling dependences and the register dependences among operations. Moreover, an edge from operation i to operation j , w iterations later, is associated with a latency li;j and a dependence distance !i;j = w. All variables used here are summarized in Table 5.1.

5.3.1 Assignment Constraints Consider a loop with N operations and an initiation interval of II . We represent a schedule for this loop by a II  N binary matrix, called A, where ar;i = 1 if and only if operation i is scheduled in row r of the MRT and 0 otherwise. The rst condition that a valid modulo schedule must satisfy is that each operation is scheduled exactly once per iteration: IIX ?1 r=0

8i 2 [0; N )

ar;i = 1

(5.1)

Equation (5.1) de nes all the assignment constraints, i.e. the constraints that assign each operation to exactly one row of the MRT.

5.3.2 Mapping-Free Resource Constraints The second condition that a valid modulo schedule must satisfy is that no cycle of the schedule does consume more resources than are available in the machine. Consider an operation i which uses resource q for one cycle exactly c cycles after being issued (referred to as a usage of resource q in cycle c in Chapter 2). Operation i utilizes 169

resource q in row r if operation i is scheduled in row (r ? c) mod II . Using the binary matrix A, we know that resource q is reserved in row r if a(r?c)modII;i = 1. To account for all the usages of resource q in a given row, we simply need to sum the appropriate a variables for each resource usage of q in each operation. For a machine with Mq machine resources of type q, the resource constraint for row r and resource q simply states that no more than Mq resources of type q can be concurrently reserved among all operations, namely: NX ?1

X

a(r?c)modII;i  Mq i=0 c2Resi;q

8q 2 Q; 8r 2 [0; II )

(5.2)

where Q is the set of resource types and c 2 Resi;q indicates that operation i uses a resource of type q exactly c cycles after being issued. Inequality (5.2) precisely models the resource requirements in each cycle, and any assignment, A, that satis es Inequality (5.2) has no resource con icts, provided that the machine satis es the following property.

De nition 5.1 (Mapping-Free Property) A machine is said to be mapping-free

if for every schedule that satis es resource constraint (5.2) there is a mapping from each operation's resource usages to resource instances that is consistent with the hardware requirements of the machine.

We describe here three classes of machines that satisfy this property. First, consider a machine with no more than one instance of each resource, i.e. q  1 for each q 2 Q. Such a machine clearly satis es the mapping-free property because, since there is only one instance to choose from, the mapping from an operation's resource usages to resource instances is trivial. Second, consider a machine where, for each resource q 2 Q with Mq > 1, each operation that uses q can arbitrarily use any instances of q, in each cycle in which the operation reserves one or more instances of q. This machine also satis es the mappingfree property because there are no speci c hardware requirements that restrict the mapping from an operation's resource usages to resource instances. 170

Third, consider a machine for which there is a string constraint that requires certain uses of a resource by some operations that reserve that resource for several consecutive cycles to use the same instance of that resource for that entire string of cycles; however, every other resource usage over all operations can be mapped freely to any instance of the resource used. Note that we are guaranteed to be able to nd a legal mapping from each operation's strings to resource instances by using the algorithm presented in the proof of Theorem 3.1 and unrolling the loop some nite number of times. Note that Theorem 3.1 applies since this mapping problem is isomorphic to the problem of mapping virtual registers (each of which has a lifetime that is a string of consecutive cycles and must be assigned to the same physical register for its entire lifetime) to physical registers. After the strings are mapped, mapping the remaining resource usages is unrestricted and therefore trivial. Thus this machine satis es the mapping-free property as well. Note that, since the de nition of this third machine includes all machines with some mix of fully pipelined and non-pipelined execution units as de ned in [6], they constitute an important and common class of machines. Thus, for these three classes of machines and for any other machines with the mapping-free property, Inequality (5.2) insures that there are no resource con icts and that a legal mapping to resource instances can be found. There are, however, machines that do not satisfy the mapping-free property. For example, consider a machine with multiple instances of resource q (i.e. Mq > 1) that is used by an operation in two (or more) nonconsecutive cycles, without using that resource in the intervening cycles, and where the same instance of q must be used in both these cycles. Since these two usages are coupled, we must not only guarantee that no more than Mq resources are used in each cycle but also ensure that there is a legal mapping from the operation's resource usages of q to instances of q. We can handle these machines in one of the following three ways. One way is to use the alternative operations approach introduced in Chapter 2 and decide before scheduling which operations are mapped to which alternative, and which usages of a problematic resource are mapped to which instance of that resource. Thereby we would essentially 171

be considering each instance of a problematic resource to be a di erent resource so as to create a machine with only a single instance of each problematic resource. Finding an optimal solution may then, however, require solving many scheduling problems that di er in the alternatives chosen. A second way is to \ ll-in" such coupled nonconsecutive cycle usages with usages in the intervening cycles, thereby holding the resource throughout the intervening cycles, rendering it unusable by other operations during this period, and accepting the possible nonoptimal schedules that results. However, note that if each string formed by such added usages is contained within the usage pattern within the corresponding maximal resource, as de ned in Step 2 of Section 2.2, then the resulting schedule will still be optimal. A third way is to use the formulation recently proposed by Altman et al [7] which handles both the scheduling and the mapping problems simultaneously.

5.3.3 Simple Formulation of the Scheduling Constraints The third condition that a valid schedule must satisfy is that each scheduling dependence present in its dependence graph is enforced. We now introduce two derived parameters that characterize the MRT row and the time at which each operation is scheduled:

rowi timei

as:

row number in which operation i is scheduled (from 0 to II ? 1). time at which operation i is scheduled

Using matrix A, we may write the row number in which operation i is scheduled

rowi =

IIX ?1 r=1

r  ar;i

(5.3)

A modulo schedule also de nes the stage in which each operation is scheduled. We represent the stage numbers by k, an integer vector of dimension N , where ki is 172

the stage number in which operation i is scheduled. Matrix A and vector k uniquely de ne the cycle in which each operation is scheduled. The schedule time of operation i is equal to:

timei = ki  II + rowi

(5.4)

A modulo schedule must enforce all the scheduling dependences of its dependence graph. A dependence between operation i and operation j , !i;j iterations later, is ful lled if operation j is scheduled at least li;j cycles after operation i: (timej + !i;j  II ) ? timei  li;j

8(i; j ) 2 Esched

(5.5)

Substituting Equations (5.3) and (5.4) into Inequality (5.5) results in the following inequality: IIX ?1 r=1

r  (ar;j ? ar;i) + (kj ? ki)  II  li;j ? !i;j  II

8(i; j ) 2 Esched (5.6)

Inequality (5.6) de nes all the scheduling constraints of a modulo schedule for a given initiation interval II with respect to the dependence distances !i;j and dependence latencies li;j of a dependence graph G.

5.3.4 Ecient Formulation of the Scheduling Constraints Solving an integer linear programming model is generally implemented by iteratively solving a linear programming model where additional constraints are introduced (and removed) to force each integer variable to an integer value without omitting from the solution space any optimal (integer) solution [69]. A branch-and-bound algorithm can be used to determine which parts of the search space to consider, and each branchand-bound node is generally evaluated by solving a linear programming model with the original constraints augmented by some additional constraints that force variables to integer values. A key aspect for formulating an ecient integer program is 173

to nd a formulation that results in fewer branch-and-bound nodes, a goal that can be achieved in part by structuring the problem so that the linear programming solver naturally results in an integer solution for as many integer variables as possible. One bene cial property of a problem formulation is de ned here:

De nition 5.2 (Structured Constraints) A constraint is de ned as structured if each variable appears at most once, multiplied by either a +1, 0, or ?1 constant coef-

cient. By extension, a formulation is de ned as structured if each of its constraints are structured.

Note that the assignment and the resource constraints do satisfy this property, whereas the scheduling constraints formulated in Inequality (5.6) do not satisfy this property, since the elements of the k vector are multiplied by II and the elements of the binary A matrix are multiplied by r 2 [0; II ). In this section, we reformulate the scheduling constraints so that the scheduling constraints are structured, thus resulting in a more ecient representation of the modulo scheduling problem. The basic idea for this reformulation is due to Chaudhuri et al [23] which has such a reformulation for straight line (nonloop, nonbranching) code. The adaptation of this idea to modulo schedules for loop code is, however, not straightforward and substantially di erent in detail, as is the proof of the validity of this adaptation. Experimental evidence presented in Section 5.6.1 will indicate the bene t of using the structured formulation; for example, using the constraint presented in this section directly resulted in an average decrease in visited branch-and-bound nodes by a factor of 128, for the formulation that minimizes MaxLive. We derive a structured formulation of the scheduling constraint associated with scheduling edge (i; j ) as follows. First, we assume that we know the scheduling time of operation i (i.e. timei is known and constant) and derive a structured constraint that precisely determines the scheduling times for operation j that satisfy scheduling edge (i; j ). Second, we show how to extend this result without assuming the value of timei to be known and constant. 174

Case with known time

i

Recall that a scheduling constraint associated with scheduling edge (i; j ) enforces the scheduling dependence between operation i and operation j , !i;j iterations later, and was de ned in Inequality (5.5) as timej +!i;j II ?timei  li;j . Because each term in the scheduling constraint has an integer value, we may reformulate this inequality as a strict inequality, i.e. timej + !i;j  II ? timei > li;j ? 1. We may thus write:

timej + !i;j  II > timei + li;j ? 1

(5.7)

The left hand side of Inequality (5.7) corresponds to the times where the value produced by operation i can legally be used by operation j (!i;j iterations later). For conciseness, we refer to this value as timeu (for the time of use) in the remainder of this section. Using the relations timej = kj  II + rowj and timeu = ku  II + rowu , we may thus de ne the stage and row of timeu as, respectively,

ku = kj + !i;j

rowu = rowj

(5.8)

Similarly, the right hand side of Inequality (5.7) corresponds to the latest time in which the value produced by i is forbidden from use. We refer to this value as timef (for the last forbidden time) in the remainder of this section. Using the relations timei = ki  II + rowi and timef = kf  II + rowf , we may thus de ne the stage and row of timef as, respectively, $ % row i + li;j ? 1 kf = ki + rowf = (rowi + li;j ? 1) mod II (5.9) II Note that since we assume here that the value of timei is known and constant, by extension, the values of kf and rowf are known and constant as well. The new de nitions are summarized here for later reference.

175

timeu timef

time where the value produced by operation i can legally be used by operation j (!i;j iterations later). latest time in which the value produced by i is forbidden from use.

Using the de nitions from Equations (5.8) and (5.9), we may write the scheduling constraint expressed in Inequality (5.7) as:

ku  II + rowu > kf  II + rowf

(5.10)

We may transform Inequality (5.10) to isolate the two stage numbers:

? rowu ku ? kf > rowf II

(5.11)

Interestingly, we can show that the row di erence rowf ? rowu has values in the range (?II; II ) since rows, by de nition, have values in the range [0; II ). Consequently, the right hand side of Inequality (5.11) has values in the range (?1; 1). Therefore, we can guarantee that when the integer valued stage di erence ku ? kf is 1 (or larger), the scheduling constraint is satis ed. Similarly, we can guarantee that when the stage di erence ku ? kf is -1 (or smaller), the scheduling constraint is violated. We may thus write: (ku > kf ) ) scheduling constraint is satis ed (ku < kf ) ) scheduling constraint is violated

(5.12) (5.13)

We are thus able to determine the scheduling times of operation j that satisfy the scheduling constraint when ku 6= kf . Otherwise kf = ku , i.e. both the use time of the value produced by operation i and the latest forbidden time for that value occur in the same stage of the schedule, and thus the speci c values of rowu and rowf must be take into account to determine whether timej is feasible. 176

To evaluate this case, let kf = ku and substitute kf for ku in Inequality (5.10), obtaining the following scheduling constraint:

rowu > rowf

(5.14)

Inequality (5.14) states that, since operation i is scheduled in rowi, its value is only available after rowf , and thus if operation j is scheduled in the same stage as timef , operation j must be assigned to a row that is strictly larger than rowf . Reformulating Inequality (5.14) using the binary ar;j variables, we obtain:

z=

row Xf x=0

ax;j = 0

(5.15)

where the value of the sum Equation (5.15) is referred to as z. Because of the assignment constraints, we know that only one ar;j variable is equal to 1, and all other ar;j variables are equal to 0. Thus, we know that z is either 0 or 1, depending on the row in which operation j is scheduled. Consequently, we can clearly see that Equation (5.15) is satis ed if and only if operation j is not assigned to any of the rows in the range [0; rowf ]. As a result, Inequality (5.14) and Equation (5.15) are equivalent. We may thus write: (ku = kf ) & (z = 0) ) scheduling constraint is satis ed (ku = kf ) & (z 6= 0) ) scheduling constraint is violated

(5.16) (5.17)

To summarize our ndings, we have shown that operation j satis es the scheduling edge (i; j ) if either Relation (5.12) is satis ed (i.e. ku > kf ) or Relation (5.16) is satis ed (i.e. ku = kf and z = 0). Otherwise, we have shown by Relations (5.13) and (5.17) that the scheduling constraint is violated. We may now combine the two disjoint Relations (5.12) and (5.16) by formulating the following constraint:

kf ? ku + z  0 177

(5.18)

Inequality (5.18) is equivalent to the union of Relations (5.12) and (5.16) because when ku > kf , Inequality (5.18) holds regardless of the value of z since z 2 [0; 1] and when ku = kf , Inequality (5.18) holds precisely when z = 0. Using the de nitions in Equations (5.8), (5.9), and (5.15) in Inequality (5.18), we obtain the following constraint: $ % (rowi +li;jX ?1)modII row i + li;j ? 1 ki + ax;j  0 (5.19) ? kj ? !i;j + II x=0 Consequently, we have derived a constraint that determines, for a given timei (expressed in terms of ki and rowi ), the scheduling times for operation j (expressed in terms of kj and ax;j ) that satisfy the scheduling edge (i; j ). This constraint is structured in that its variables (i.e. kj and ax;j ) appear only once and are only multiplied by +1, 0, and ?1 constant coecients.

Case with unknown time

i

In this case, we do not assume to know the time at which operation i is scheduled. Our claim is that the following inequality: $ % (r+li;j ? X1)modII r + l i;j ? 1 ki + ax;j  1 ? ar;i ? k ? ! + j i;j II x=0 8r 2 [0; II ) (5.20) is equivalent to Inequality (5.19) when r = rowi and is trivially satis ed otherwise. First, we show that Inequalities (5.19) and (5.20) are equivalent when r = rowi. Since r = rowi is assumed, we can substitute rowi for r in Inequality (5.20), obtaining the same left hand side as in Inequality (5.19). Moreover, after the substitution step, the right hand side of Inequality (5.20) is (1?arowi ;i). Because operation i is scheduled in rowi , we know that arowi;i = 1, and thus the right hand side of Inequality (5.20) is also 0. Thus, Inequalities (5.19) and (5.20) are identical when r = rowi . Second, we must show that Inequality (5.20) is trivially satis ed when r 6= rowi. Since r 6= rowi , we know that ar;i = 0, and thus we may rewrite the r 6= rowi cases 178

of Inequality (5.20) as follows:

$

% (r+li;j ? X1)modII r + l i;j ? 1 ki + ax;j  1 ? k ? ! + j i;j II x=0

(5.21)

8r 2 [0; II ); r 6= rowi

We must now show that Inequality (5.21) is always less constraining than Inequality (5.19). Comparing these two inequalities, the condition under which Inequality (5.21) holds without introducing additional constraints is: $ % $ % r + li;j ? 1 ? rowi + li;j ? 1 + II II (r+li;j ? X1)modII

x=0

ax;j ?

(rowi +li;jX ?1)modII

x=0

ax;j  1

(5.22)

8r 2 [0; II ); r 6= rowi

Since r 6= rowi, we must consider two cases: 0  r < rowi < II and 0  rowi < r < II . In the rst case, we know that the di erence of the two oor functions in Inequality (5.22) is either 0 or ?1. However, by the assignment constraints, each of the summations in Inequality (5.22) is either 0 or 1. Thus the left hand side of Inequality (5.22) is in the range [?2; 1]. Inequality (5.22) is thus trivially satis ed and introduces no new constraints if r < rowi . In the second case, r > rowi , and the di erence of the two oor functions in Inequality (5.22) is either 0 or 1. (a) If this di erence is 0, we note that since the di erence of the two summations is either ?1, 0, or 1, Inequality (5.22) is trivially satis ed and introduces no additional constraints. (b) If the di erence of the two oor functions is 1, i.e. the integer part of (r + li;j ? 1)=II is greater than the integer part of (rowi + li;j ? 1)=II , it follows that, since r ?rowi < II , the fractional part of (r +li;j ?1)=II is less than the fractional part 179

of (rowi + li;j ? 1)=II , i.e. (r + li;j ? 1) mod II < (rowi + li;j ? 1) mod II . The set of the terms in the rst summation is thus a subset of the terms in the second summation. The di erence between the two summations in Inequality (5.22) is thus either 0 or ?1. Therefore, once again, Inequality (5.22) is trivially satis ed and introduces no new constraints. As a result, Inequality (5.22) is always trivially satis ed, as is Inequality (5.21), and the only constraining case of Inequality (5.20) is r = rowi which is equivalent to Inequality (5.19).

Final Form of the Structured Scheduling Constraints Since we have just shown that Inequality (5.20) is equivalent to Inequality (5.19) when r = rowi and is trivially satis ed otherwise, we can simply state that the scheduling constraints of all the scheduling edges are satis ed when: % $ (r+li;j ? X1)modII r + l i;j ? 1 ax;j  1 ? ar;i (5.23) ? kj ? !i;j + ki + II x=0

8r 2 [0; II ); 8(i; j ) 2 Esched Chaudhuri et al [23] make following observation for two dependent operations in straight line (nonloop, nonbranching) code: consider operation i, with latency l, that produces a value used by operation j . When operation i is assigned to cycle t, or any subsequent cycles, operation j must be assigned in a cycle t0  t + l. Using a similar observation here, we may replace ar;i in the right hand side of Inequality (5.23) by the sum of the ax;i variables over x 2 [r; II ). Thus, we obtain the following nal form of the structured scheduling constraints:

180

(r+li;j ? X1)modII ax;j ax;i + x=r x=0

IIX ?1

+ ki ? kj

$

% r + l i;j ? 1  !i;j ? +1 II

(5.24)

8r 2 [0; II ); 8(i; j ) 2 Esched At rst, it may appear counterintuitive to replace Inequality (5.6) with Inequality (5.24) in order to obtain a more ecient formulation of the problem, since it corresponds to replacing each constraint with II new constraints. The crucial point, however, is that the new constraints are structured, since each variable, namely each k and each element of A appears at most once, and is multiplied by only +1, 0, or ?1 coecients.

5.3.5 Average Lifetime Requirements We have now completed the constraints that de ne the modulo scheduling space and continue with the presentation of three distinct objective functions. We have seen in Section 3.2 that the average lifetime requirements correspond to the sum of all the lifetimes divided by II . We represent the lifetime here by f , an integer vector of dimension N , where fi is the number of cycles in which the value produced by operation i is live. In our machine model, a lifetime begins in the cycle in which its def-operation is scheduled, and lasts through the cycle in which its last-use operation is scheduled. Thus, we may de ne the lifetime fi as follows: (timej + !i;j  II ) ? timei + 1  fi

8(i; j ) 2 Ereg

(5.25)

where timei is the cycle in which the def-operation i is scheduled and timej + !i;j  II is the cycle in which the use-operation j , !i;j iterations later, uses that value. By substituting Equations (5.3) and (5.4) into Inequality (5.25) we obtain the following

181

inequality: IIX ?1 r=1

r  (ar;j ? ar;i) + (kj ? ki )  II + !i;j  II + 1  fi

(5.26)

8(i; j ) 2 Ereg Note that since, for example, the variables ki and kj are multiplied by II , and the variables ar;i and ar;j are multiplied by r 2 [1; II ), Inequality (5.26) is not structured. The average lifetime is simply the sum of the fi for each operation in the loop, divided by II : NX ?1 AvgLive = II1 fi i=0

(5.27)

Since multiplying (or dividing) an objective function by a strictly positive number is guaranteed to result in an equivalent solution (i.e. a solution with the same objective function value, once multiplied by that number), we may simply ignore the 1=II factor in Equation (5.27) when searching for a schedule with minimum AvgLive. After the 1 II factor is removed from Equation (5.27), the formulation is identical to the one proposed by Dupont de Dinechin [29], for example, except for the presence of the +1 term (which is particular to our virtual register model) in Inequality (5.26).

5.3.6 Bu er Requirements As indicated in Section 3.2, bu ers must be reserved for only intervals that are integer multiples of II cycles. We represent here the bu er requirements by b, an integer vector of dimension N , where bi corresponds to the number of bu ers that are associated with operation i. The bu er requirements resulting from register edge (i; j ) is de ned as: (timej + !i;j  II ) ? timei + 1  bi  II 182

(5.28)

where, unlike fi in Inequalities (5.25) and (5.26), the variable bi is multiplied by II since it must cover an interval that is a multiple of the initiation interval. By substituting Equations (5.3) and (5.4) into Inequality (5.28) we obtain in the following inequality: IIX ?1 r=1

r  (ar;j ? ar;i) + (kj ? ki )  II + !i;j  II + 1  bi  II

(5.29)

8(i; j ) 2 Ereg where the variables ki, kj , and bi are all multiplied by II , and the variables ar;i and ar;j are multiplied by r 2 [1; II ). The total bu er requirements is then simply equal to the sum of the bu er requirements associated with each operation, i.e.

Buff =

NX ?1 i=0

bi

(5.30)

The formulation of Bu is identical to the one proposed by Govindarajan et al [44], except for the presence of the +1 term (which is particular to our virtual register model) in Inequality (5.29). Note that, as formulated here, the formulation of Bu is not structured because of Inequality (5.29).

5.3.7 Register Requirements We now de ne the register requirements associated with a modulo schedule. Since we are minimizing MaxLive, we must compute the precise number of live virtual registers in each of II consecutive cycles. We de ne a new II  N integer matrix, called V , where vr;i corresponds to the number of live virtual registers generated by operation i in row r. Matrix V is de ned as follows: r X z=0

az;i ?

rX ?1 z=0

az;j + (kj + !i;j ) ? ki  vr;i

8(i; j ) 2 Ereg ; 8r 2 [0; II ) 183

(5.31)

Inequality (5.31) quanti es the register requirements in row r generated by the register dependence (i; j ), from the def operation i to the use operation j . To demonstrate the validity of Inequality (5.31), we compute the fractional and integral parts of the lifetime associated with a register edge (i; j ), namely:

fracr;i;j

fractional part of the lifetime in row r associated with a register edge (i; j ). integral part of the lifetime associated with a register edge (i; j ).

inti;j

We will demonstrate that their sum is exactly equal to vr;i if operation j is the lastuse of the operation i result, and never exceeds vr;i if j is not the last-use operation. We introduce three de nitions speci c to def operation i and use operation j :

dr =

r X z=0

az;i ;

ur =

rX ?1 z=0

az;j ;

8 > < 1 if rowi > rowj x=> : 0 otherwise

(5.32)

Recall that among all rows, z, for a given operation, i, only one az;i is nonzero and that az;i is 1. As a result, both dr and ur are nondecreasing functions of r, with minimum value 0 and maximum value 1. As de ned in Equation (5.32), dr becomes 1 exactly in the row z where def operation i is scheduled (az;i = 1). Similarly, ur becomes 1 exactly in the row z following the row where use operation j is scheduled (az?1;j = 1). Once dr or ur become 1, they remain 1 through row II ? 1. When computing the fractional part of the lifetime associated with def operation i and use operation j , two cases must be considered. In the rst case, rowi  rowj . Therefore, the initial value of the term dr ? ur is initially 0, becomes 1 in rowi , and returns to 0 in the row after rowj . As a result, the term dr ? ur is 1 exactly where the fractional part of the lifetime associated with (i; j ) would contribute to the register requirements if operation j is the last use; elsewhere, dr ? ur is 0. In Figure 5.3a for example, the def operation i is scheduled in row 1 (a1;i = 1) and the use operation j is scheduled in row 3 (a3;j = 1). The plots of dr , ur , and dr ? ur are also shown in 184

a) Value live from row 1 to 3 row

MRT (II=5)

dr

ur

b) Value live from row 4 to 1 (with wraparound) d r−u r

MRT (II=5)

dr

ur

0 1

0 1

d r−u r+1

0 1

use: op j

def: op i

2 3

use: op j def: op i

4 0 1

0 1

0 1

0 1

Figure 5.3: Determining the fractional and integral part of a lifetime. Figure 5.3a. Notice that the fractional part of this lifetime is correctly determined to be live in rows 1, 2 and 3. In the second case, rowi > rowj . The initial value of the term dr ? ur + 1 is 1, becomes 0 in the row following rowj , and returns to 1 in rowi . As a result, the term dr ? ur + 1 is 1 exactly where the fractional part of the lifetime associated with (i; j ) would contribute to the register requirements if operation j is the last use; elsewhere, dr ? ur + 1 is 0. In Figure 5.3b for example, the def operation i is scheduled in row 4 (a4;i = 1) and the use operation j is scheduled in row 1 (a1;j = 1). The plots of dr , ur , and dr ? ur + 1 are shown in Figure 5.3b. We see that the fractional part of this lifetime is correctly determined to be live in rows 4, 0, and 1. Using the de nitions of Equation (5.32), we can express the fractional part of the lifetime associated with operations i and j in row r as:

fracr;i;j = dr ? ur + x

(5.33)

where fracr;i;j is 1 if and only if row r contributes to the fractional part of the lifetime associated with operations i and j . To compute the integral part of the lifetime associated with operations i and j , we must also consider two cases. In the rst case, rowi  rowj , and the integer part corresponds directly to the di erence between the stage numbers of the use and the def operations, namely (kj + wi;j ) ? ki . In Figure 5.3a, the integral part is 0 when 185

kj = ki and !i;j = 0. If the use operation was delayed by II cycles, kj would increase by 1 and so would the integral part. In the second case, rowi > rowj , and the integer part corresponds directly to (kj + wi;j ) ? ki ? 1, where the -1 term is introduced to counterbalance the wraparound e ect. In Figure 5.3b, the integral part is 0 when kj = ki + 1 and !i;j = 0. Again, delaying the use operation by II cycles would increase the integral part by 1. As a result, we can express the integral part of the lifetime associated with operations i and j as: inti;j = kj + wi;j ? ki ? x

(5.34)

By summing the fractional and integral parts of the lifetime, as de ned in Equations (5.33) and (5.34), the x terms cancel each other, resulting in the left hand side of Inequality (5.31). Finally, the register requirement contribution of an operation i in row r corresponds to the maximum register requirements among all uses of operation i. Using the previous result, MaxLive is de ned as the sum of the fractional and integral part of the lifetime associated with each virtual register in the row with the largest fractional plus integral part: NX ?1 i=0

vr;i  MaxLive

8r 2 [0; II )

(5.35)

Since the objective is to minimize MaxLive, it is sucient to use Inequality (5.31) rather than setting vr;i equal to the maximum left-hand side over all operations j for the operation i under consideration. Note also that the formulations of both Inequalities (5.31) and (5.35) are structured. For a machine with multiple register les, we would introduce one MaxLive variable per register le and we would transform Inequality (5.35) by summing the vr;i variables only over each operation that writes its results in the register le being considered. The same modi cation also apply to Equations (5.27) and (5.30) to min186

imize, respectively, the average lifetime or bu er requirements for individual register les.

5.4 Optimum Modulo Schedules We present in this section an algorithm that nds a schedule with the highest steady state throughput over all modulo schedules, and the minimum register requirements among such schedules. The formulation is based on the scheduling space de ned by Inequalities (5.1), (5.2), and (5.24) and on the objective function for minimum MaxLive de ned by Inequalities (5.31) and (5.35). This model and its variables are summarized, respectively, in Figure 5.4 and Table 5.1. Note that the formulation nds a schedule with minimum MaxLive for a given initiation interval. Therefore, a schedule with minimum register requirements for the smallest feasible initiation interval is obtained by solving a series of problems until the smallest initiation interval with a feasible solution is found.

Algorithm 5.1 (MinReg Modulo Scheduler) Consider a machine description Mq , Resi;q , and li;j and consider a data dependence graph G = fV; Esched ; Ereg g and !i;j . The following algorithm nds a schedule with the highest steady state throughput over all modulo schedules, and the minimum register requirements among such schedules: 1. Compute the Minimum Initiation Interval (MII ) [78] and set the tentative II to MII . 2. Build the integer linear programming system as indicated in Figure 5.4 and search for a solution. 3. If the system built in Step 2 fails to nd a feasible solution, increase II by one and return to Step 2. Otherwise, the solution found is optimal.

For comparison purposes, we also investigated a model minimizing AvgLive, as formulated by the Inequalities (5.26) and (5.27) in place of the Lifetimes and MaxLive inequalities in Figure 5.4. Similarly, a model minimizing Buff as formulated by 187

Minimize:

MaxLive

Subject to: Assignment: Resources: Dependences:

IIX ?1 r=0 NX ?1

8i 2 [0; N )

ar;i = 1

X

i=0 c2Resi;q

a(r?c)modII;i  Mq

(r+li;j ? X1)modII ax;j ax;i + x=r x=0

IIX ?1

8q 2 Q ; 8r 2 [0; II ) $

% r + l i;j ? 1 + ki ? kj  !i;j ? +1 II

8r 2 [0; II ); 8(i; j ) 2 Esched Lifetimes: MaxLive:

De nitions:

r X z=0

az;i ?

NX ?1 i=0

rX ?1 z=0

az;j + kj ? ki + !i;j  vr;i 8r 2 [0; II ) ; 8(i; j ) 2 Ereg

vr;i  MaxLive

8r 2 [0; II )

MaxLive; ar;i; vr;i; ki : nonnegative integers

Figure 5.4: Modulo scheduling for minimum MaxLive.

188

the Inequalities (5.29) and (5.30) in place of the Lifetimes and MaxLive inequalities in Figure 5.4, and a model with no objective function formulated by deleting the objective function and the Lifetimes and MaxLive inequalities in Figure 5.4. Each of these formulations can use precisely formulation of the precisely the same scheduling space as used when minimizing MaxLive. Objective function none AvgLive Bu MaxLive

Variables N (II + 1) N (II + 2) N (II + 2) N (2II + 1)

Constraints N + II  jQj + II  jEsched j N + II  jQj + II  jEsched j + jEreg j N + II  jQj + II  jEsched j + jEreg j N + II  jQj + II  jEsched j + II  (jEreg j + 1)

Table 5.2: Numbers of variables and constraints for 4 distinct formulations. The complexity of these models, in terms of the number of variables and constraints, is shown in Table 5.2. Note that the numbers of binary and integer variables needed for the formulation of the modulo scheduling space are N  II and N , respectively. And N + II jQj + II jEsched j constraints are used to de ne the space of valid modulo schedules. When either AvgLive or Bu is minimized, N additional integer variables are introduced as well as jEreg j additional constraints. If MaxLive is minimized, N  II additional integer variables are introduced, along with II  (jEreg j + 1) new constraints. Note also that if, instead of the structured scheduling constraints, the traditional scheduling constraints are used, as de ned by Inequality (5.6), the number of constraints decreases by (II ? 1)  jEsched j for all 4 models in Table 5.2. Experimental evidence gathered from the 1327-loop benchmark in Section 5.6.1 shows that an ecient formulation is not only dependent on the numbers of variables and constraints in the formulation, but also on whether the constraints are structured or not.

189

5.5 Bounding the Schedule Space As presented in Figure 5.4, the integer linear programming search space is unbounded because there is no limit on the ki values and hence on the maximal schedule length. To bound the schedule length of an iteration and thereby signi cantly improve the eciency of the integer programming solver, we introduce two pseudo operations in the dependence graph: start is a predecessor of every operation in the graph, and stop is a successor to every operation in the graph. The latency and dependence distance of the outgoing edges from start and of the incoming edges to stop are 0. Using the methodology presented by Hu , the MinDist relation can be used to determine a lower bound and an upper bound on the schedule time of each operation [49]. We know that operation i cannot be scheduled earlier than MinDist(start; i) cycles after the start pseudo operation. Similarly, we know that operation i cannot be scheduled later than MinDist(i; stop) cycles before the stop pseudo operation. To bound the search space of the integer linear programming solver, we must relate the schedule time of these two pseudo operations to one another. Without loss of generality, we de ne the maximal schedule length of a schedule to be MinDist(start; stop) + S where S is a nonnegative integer value. Using this result, we bound the schedule time of operation i as follows:

MinDist(start; i)  timei (5.36) timei  MinDist(start; stop) + S ? MinDist(i; stop) (5.37) where S represents the schedule slack or degree of freedom given to the scheduler.

Theorem 5.1 Let RS be the minimum MaxLive for a dependence graph G and initiation interval II among all schedules of schedule length no greater than MinDist(start; stop) + S . The minimum MaxLive values are bounded as follows:

R0  R1  : : :  Ry?1  Ry = Ry+1 = : : : = R1; where y = (N ? 1)(lmax + II ? 1) ? MinDist(start; stop) 190

Note also that lmax corresponds to the longest latency among all operations present in the loop iteration.

Proof. The relation RS  RS+1 (for any positive schedule slack S ) clearly holds

since any schedule with minimum MaxLive and schedule length no greater than MinDist(start; stop) + S is necessarily included in the set of schedules of length no greater than MinDist(start; stop) + S + 1. To prove Ry = R1 , we must show that Ry = minimum MaxLive and hence no Rx can be less than Ry if x > y. Suppose that Ry 6= minimum MaxLive, i.e. Ry > minimum MaxLive. Then the only minimum MaxLive schedules must have a schedule length greater than MinDist(start; stop)+ y. We demonstrate that this is impossible by showing that for any schedule of length greater than MinDist(start; stop)+y, there must be another valid schedule of length no greater than MinDist(start; stop) + y with no larger register requirements. Consider a schedule of length z > MinDist(start; stop)+y, i.e. z > (N ?1)(lmax + II ? 1). We now show that we can trivially transform this schedule to obtain a valid schedule of length no greater than MinDist(start; stop) + y with no larger register requirements. This demonstration is based on the pigeonhole principle. In the schedule of length z, consider each interval [ti; tj ] where no operation is scheduled in the interval (ti; tj ). Since there are N operations, there are at most N ? 1 such intervals. Furthermore, since the length of the schedule, z, is strictly larger than (N ? 1)(lmax + II ? 1), there must be at least one interval1 of length at least lmax + II . Consider [t0i; t0j ] to be such an interval. We may then trivially reschedule all operations scheduled after t0i, II cycles earlier, without violating any dependence constraints. Moreover, rescheduling operations II cycles earlier is guaranteed to satisfy the resource constraints as the MRT is unchanged by the new schedule. Consequently, this transformation results in a valid modulo schedule with a schedule length that is reduced by II cycles. By nite induction, we can reduce the schedule length z to z0, where z0  (N ? 1)(lmax + II ? 1). We assume here N > 1, i.e. we exclude here the trivial problem of scheduling a loop with a single operation. 1

191

Furthermore, reducing the schedule length by multiples of II is guaranteed to result in no larger register requirements because it reduces all the lifetimes, if any, that span the selected [t0i; t0j ] intervals by II cycles for each [t0i; t0j ] intervals spanned. Thus this new schedule of length z0 is precisely the schedule we set out to nd. Therefore, limiting the schedule length to (N ? 1)(lmax + II ? 1), i.e. limiting the scheduling slack to y = (N ? 1)(lmax + II ? 1) ? MinDist(start; stop), does not preclude nding a minimum MaxLive solution, and Ry = Ry+1 = : : : = R1 . 2

5.6 Measurements In this section, we compare the performance of the optimal modulo scheduling algorithms developed in this chapter, the stage-scheduling algorithms presented in Chapter 4, and a register-insensitive modulo scheduler. In addition to the scheduling algorithms presented in Section 4.6, we use here the following modulo scheduling algorithms.

MinReg Modulo-Scheduler. This scheduler nds a schedule with the maximum steady-state throughput over all modulo schedules, and with the minimum register requirements among such schedules. It uses the integer programming model developed in Section 5.3 with the minimum MaxLive objective function, as summarized in Figure 5.4.

MinBu Modulo-Scheduler. This scheduler nds a schedule with the maximum

steady-state throughput, and the minimum bu er requirements among all such schedules. It uses the same formulation of the scheduling space as the MinReg ModuloScheduler and di ers only by its objective function, which minimizes Bu , as computed using Inequalities (5.29) and (5.30) in place of the Lifetimes and MaxLive inequalities of Figure 5.4. Although it minimizes bu ers, in our comparisons we always present the actual register requirements associated with these schedules.

192

MinLife Modulo-Scheduler. This scheduler nds a schedule with the maximum steady-state throughput, and the minimum average lifetime among all such schedules. It di ers from MinReg Modulo-Scheduler only by its objective function, which minimizes AvgLive, as computed by using Inequalities (5.26) and (5.27) in place of the Lifetimes and MaxLive inequalities of Figure 5.4. In our comparisons, we also present the actual register requirements associated with these schedules.

NoObj Modulo-Scheduler. This scheduler simply nds a schedule with the maxi-

mum steady-state throughput, without minimizing any objective function. It uses the same formulation of the modulo scheduling space as the MinReg Modulo-Scheduler, i.e. the same constraints as in Figure 5.4 except that the Lifetimes and MaxLive constraints are deleted. It simply returns the rst schedule that it nds. In our experimental setup, we use the CPLEX solver running on an HP-9000/715 workstation. The solver searches for an optimal solution using no more than 15 minutes of execution time. If no solution is found within the time limit, the solver aborts and \no schedule" is reported. The benchmark used here is the 1327-loop benchmark as compiled for the Cydra 5 and presented in Section 2.7. In addition to the constraints presented in Section 5.3, we bound the scheduling space using Inequalities (5.36) and (5.37) with a schedule slack of 20 cycles. We use here a constant schedule slack in order to nd schedules with small schedule length, since the length of a schedule impacts the transient performance of a loop, as indicated by Equation (3.1). We also use a lower bound on the register requirements, i.e. MinAvg as computed by Hu [49], to bound the values of the objective functions. The resource constraints are formulated for the reduced machine representation of the Cydra 5 shown in Figure 2.5b, as produced by the algorithm described in Chapter 2 for the discrete representation. For operations with alternatives, we use here the alternative that was selected by the Iterative Modulo Scheduler [77][78]. Furthermore, for a given loop, we only model the resources that are used by at least two operations, since the other resources, if any, pose no resource con icts. 193

In this section, we evaluate the performance of the integer linear programming formulations in Section 5.6.1, present the increase in throughput achieved by the modulo schedulers in Section 5.6.2, and present the decrease in register requirements obtained by these schedulers in Section 5.6.3.

5.6.1 Evaluation of the Integer Linear Programming Formulation In this section, we evaluate the performance of the integer linear programming formulation presented in Section 5.3, using several metrics such as the number and size of problems that are successfully solved, the number of branch-and-bound nodes traversed by the solver, and the number of simplex iterations performed by the solver. In our rst experiment, we investigate the bene t of using the structured formulation of the scheduling dependence (derived in Section 5.3.4) instead of the traditional unstructured formulation (presented in Section 5.3.3). When the structured scheduling constraints are used, both the NoObj and the MinReg Modulo-Schedulers use formulations that are fully structured, unlike the formulation of the MinLife and the MinReg Modulo-Schedulers since both AvgLive and Bu , respectively, are computed using some unstructured constraints. We search for optimal schedules within the 15 minute execution time limit using the MinReg, MinBu , MinLife, and NoObj Modulo-Schedulers, once using the structured scheduling constraints and once using the traditional scheduling constraints. We evaluate the performance of the modulo schedulers for the 711 loops that were successfully scheduled by all algorithms. Our rst graph presents the average number of branch-and-bound nodes visited by the solver when using either the structured or the traditional scheduling constraints. The CPLEX solver used in our experiment explores branch-and-bound nodes when it must force variables to integral values. Note that if the initial solution found by the solver is integral, no branch-and-bound node is ever visited. This graph is shown in Figure 5.5 with a logarithmic Y-axis. 194

Numbers of Branch & Bound Nodes

1000 Traditional Structured

100

10

1

0.1 NoObj

MinLife

MinBuff

MinReg

Figure 5.5: Average number of branch-and-bound nodes per loop visited by the CPLEX solver in the 711-loop benchmark.

10000

Number of Iterations

Traditional Structured

1000

100

10

1 NoObj

MinLife

MinBuff

MinReg

Figure 5.6: Average number of simplex iterations per loop performed by the CPLEX solver in the 711-loop benchmark.

195

The rst observation is that the four schedulers bene ts signi cantly from using the structured constraints which, for example, decrease the average number of nodes visited by the MinReg Modulo-Scheduler by a factor of 128. The second observation is that the two algorithms that result in the lowest average number of branch-and-bound nodes are the MinReg and the NoObj Modulo-Scheduler, when both are based on formulations that are structured, i.e. consisting exclusively of structured constraints. Figure 5.6 plots the average number of simplex iterations when scheduling for the same algorithms and dataset. Simplex iterations are used by the CPLEX program to solve the linear programming model associated with each branch-and-bound node. This graph is shown in Figure 5.6, also with a logarithmic Y-axis. Similar observations apply here: using the structured scheduling constraints signi cantly decreases the average number of simplex iterations, for example, by a factor of 34 for the MinReg Modulo-Scheduler. Note also that the smallest average number is obtained by the NoObj Modulo-Scheduler, with an average of 6.3 iterations per loop with the structured formulation and 9.7 with the unstructured formulation. Interestingly, the third lowest average number is achieved by the MinLife ModuloScheduler, with an average of 24.6 iterations per loop with the structured formulation, even though the structured MinLife Modulo-Scheduler visits on average more branchand-bound nodes than the structured MinReg Modulo-Scheduler. This e ect may be due to the fact that the MinLife Modulo-Scheduler has signi cantly fewer variables and constraints than the MinReg Modulo-Scheduler, and thus can apparently process each node with fewer simplex iterations. Experimental evidence thus indicates that searching for optimal schedules requires traversing fewer branch-and-bound nodes and performing fewer simplex iterations, on average, when using the structured rather than the traditional formulation of the dependence constraints. As a result, schedulers based on the structured scheduling constraints run signi cantly faster; for example, the structured MinReg ModuloScheduler completes in 11.6% of the time (i.e. 101.0 instead of 870.2 seconds) when scheduling all the loops successfully scheduled by the MinReg Modulo-Scheduler with unstructured scheduling constraints. Consequently, the solver can schedule a larger 196

fraction of the loops in the benchmark suite when using the structured scheduling constraints; e.g. 1181 versus 1089 loops for the NoObj Modulo-Scheduler, 937 versus 861 loops for the MinLife Modulo-Scheduler, 801 versus 765 loops for the MinBu Modulo-Scheduler, and 920 versus 781 loops for the MinReg Modulo-Scheduler. We thus exclusively employ the structured scheduling constraints in the remainder of this chapter. In our second experiment, we investigate the performance of each modulo scheduling algorithm with the structured scheduling constraints in a benchmark containing all successfully scheduled loops. We present the performance characteristics of the four modulo scheduling algorithms in Table 5.3. In addition to previous performance numbers (i.e. the number of branch-and-bound nodes and simplex iterations), the table also lists data on the numbers of variables and constraints, prior to any simpli cations that might be performed by the CPLEX solver, as well as the number of operations, N , and the initiation interval, II . First, observe the distribution of the data in Table 5.3. As indicated by the low median values, relative to the average numbers, there is a large number of simple loops in the benchmark suites. However, the solvers clearly succeed in solving some rather large problems, as indicated by the large max values. The second observation is that the NoObj Modulo-Scheduler processes loops with signi cantly larger average numbers of operations, II , variables, and constraints than the three other algorithms. It also handles the loops with the largest maximum number of operations and II , i.e. 80 operations and 118 cycles, respectively. Note also that since it simply returns the rst valid integral solution, the number of visited branch-and-bound nodes directly corresponds to the nodes that are needed by the solver to force all variables to integer values. As indicated in the table, very few branch-and-bound nodes are required (0 node in 73.9% of the loops, 7.93 nodes on average, and and never more than 337 nodes). This data con rms again the bene t of structured formulations. The third observation is that the MinReg Modulo-Scheduler processes loops that are nearly as large as those the MinLife Modulo-Scheduler can process, and larger 197

Measurements: NoObj Modulo-Sched: (1181 loops) Variables Constraints Branch-and-bound nodes Iterations II N MinLife Modulo-Sched: (937 loops) Variables Constraints Branch-and-bound nodes Iterations II N MinBu Modulo-Sched: (801 loops) Variables Constraints Branch-and-bound nodes Iterations II N MinReg Modulo-Sched: (920 loops) Variables Constraints Branch-and-bound nodes Iterations II N

min freq median average 182:50 270:86 7:93 331:98 7:51 13:91

max

4:00 0:3% 4:00 0:3% 0:00 73:9% 0:00 37:0% 1:00 31:9% 2:00 0:4%

33:00 45:00 0:00 11:00 2:00 9:00

3880:00 5320:00 337:00 20645:00 118:00 80:00

6:00 0:4% 5:00 0:4% 0:00 62:8% 0:00 24:8% 1:00 39:7% 2:00 0:5%

28:00 78:22 1232:00 34:00 109:34 1783:00 0:00 320:08 28712:00 14:00 4380:51 353654:00 2:00 5:09 59:00 7:00 8:68 41:00

6:00 0:5% 5:00 0:5% 0:00 58:8% 0:00 29:3% 1:00 47:2% 2:00 0:6%

20:00 55:80 2520:00 19:00 76:62 3464:00 0:00 441:18 25969:00 3:00 2013:64 159121:00 2:00 4:46 118:00 6:00 6:76 21:00

6:00 0:4% 6:00 0:4% 0:00 63:7% 0:00 24:3% 1:00 41:1% 2:00 0:5%

35:00 120:43 5809:00 43:00 155:23 6901:00 0:00 124:42 10711:00 23:00 2533:77 167504:00 2:00 4:72 118:00 7:00 8:37 41:00

Table 5.3: Characteristics of the modulo scheduling algorithms (using the structured scheduling constraints).

198

than those of the MinBu Modulo-Scheduler, even though it handles problems with 54.0% and 115.8% more variables, on average, than the MinLife and the MinBu Modulo-Scheduler, respectively. Thus, the structured formulation of the MinReg Modulo-Scheduler appears to successfully counterbalance the impact of its increased number of variables due to the precise modeling of the register requirements. Incidentally, the loop with the largest number of variables is solved by the MinReg Modulo-Scheduler; this loop has 37 operations, 36 register edges, 45 scheduling edges, 6 con icting resources, an II of 78, and resulted in a problem with 5809 variables and 6901 constraints. To summarize the ndings of this section, the NoObj Modulo-Scheduler clearly schedules the largest fraction of the loops in the benchmark suite. This result is not unexpected since this formulation does not minimize any objective function. The second and surprising result is that using the MinReg scheduler is nearly as ecient as using the MinLife scheduler, and is much more ecient than using the MinBu scheduler, in spite of the fact that precisely minimizing MaxLive requires signi cantly more variables than minimizing some approximations of the register requirements.

5.6.2 Maximum Throughput Measurements In our third experiment, we use the NoObj Modulo-Scheduler to search for schedules with maximum throughput. Since the Iterative Modulo Scheduler proposed by Rau [78] results in schedules with optimal throughput for 1274 (or 96.0%) of the 1327 loops in the benchmark suite, we only need to investigate here the remaining 53 loops. Among these 53 loops, the NoObj Modulo-Scheduler nds 2 loops where II can be decreased by 2 (but not 3) cycles, i.e. the scheduler nds a schedule with an II decreased by 2 cycles and shows that decreasing II by 3 cycles is infeasible. The scheduler also nds 6 loops where II can be decreased by 1 (but not 2) cycles. It furthermore shows that the II of 22 of these loops cannot be decreased even by 1 cycle. Finally, the scheduler does not complete the remaining 23 loops within the 15 minute execution time limit. Thus, the NoObj Modulo-Scheduler succeeds in nding 199

schedules with better throughput for 8 (or 15.1%) of the 53 interesting loops. Using the NoObj Modulo-Scheduler, we have thus shown that the Iterative Modulo Scheduler actually nds a schedule with maximum throughput for 22 more loops than previously known, i.e. it results in schedules with maximum throughput for 1296 (or 97.7%) of the 1327 loops in the benchmark suite. In addition to these 22 loops, the NoObj Modulo-Scheduler nds 8 more loops with maximum throughput (with a higher throughput than the schedule found by the Iterative Modulo Scheduler); thus we have now schedules with maximum throughput for 1304 (or 98.3%) of the loops in the benchmark suite. At this point, it is unknown whether the remaining 23 loops have suboptimal throughput.

5.6.3 Register Requirements Measurements Our fourth experiment investigates the register requirements of the modulo scheduling algorithms investigated in this chapter as well as some of the stage scheduling algorithms presented in Chapter 4. We use for this measurement a subset of 920 loops from the 1327-loop benchmark, corresponding to all the loops successfully scheduled by the MinReg Modulo Scheduler by using no more than 15 minutes of execution time on a HP-9000/715 workstation. The principal characteristics of the 920-loop benchmark are shown in Table 5.4. Measurements: Operations II II=MII nonredundant register edges nonredundant scheduling edges modeled machine resources

min 2:00 1:00 1:00 1:00 1:00 1:00

freq median average max 0:5% 7:00 8:37 41:00 41:1% 2:00 4:72 118:00 98:9% 1:00 1:00 1:50 0:5% 7:00 8:80 50:00 0:5% 7:00 9:20 57:00 54:8% 1:00 2:51 8:00

Table 5.4: Characteristics of the 920-loop benchmark suite. The Iterative Modulo Scheduler was used to generate the required MRT input for all stage scheduling algorithms. All scheduling algorithms used the initiation interval found by the Iterative Modulo Scheduler to compare the register requirements of the 200

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 0.5 0.4 0.3

Schedule-Independent Lower Bound

0.2

MinReg Modulo-Scheduler

0.1

Iterative Modulo Scheduler

0 0

16

32

48 64 MaxLive

80

96

Figure 5.7: Register requirements in the 920-loop benchmark. various scheduling algorithms at identical levels of instruction level parallelism. As indicated in Table 5.4, only 10 loops in the 920-loop benchmark failed to achieve MII . Of these 10 loops, we know that II can be decreased by 2 cycles for 1 loop, by 1 cycle for 4 loops, and cannot be decreased for 2 loops; the situation is unknown for the 3 remaining loops because no solution was found by the solver within the time limit. We rst investigate the register requirements of the MinReg Scheduler solutions which have the lowest register requirements over all valid modulo schedules with the achieved II . Figure 5.7 presents the fraction of the 920 loops that can be scheduled for a machine with any given number of registers without spilling and without increasing II . In this graph, the X-axis represents MaxLive and the Y-axis represents the fraction of loops scheduled. This graph presents three curves. The rst curve, labeled \Schedule Independent Lower Bound," corresponds to [49] and is representative of the register requirements of a machine without any resource con icts. The second curve presents the register requirements of the schedules generated by the MinReg Modulo201

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6 0.5 0.4

MinReg Stage-Scheduler

0.3

Heuristic AC+3*UP+RSS

0.2

Heuristic 3*UP+RSS (Bounded)

0.1

Iterative Modulo Scheduler

0 0

4

8

12

16

20 24 28 32 Additional MaxLive

36

40

44

48

Figure 5.8: Additional register requirements relative to the MinReg Modulo-Scheduler. Scheduler. The fact that these rst two curves are nearly identical indicates that the complex resource requirements of the Cydra 5 machine do not signi cantly impact the register requirements of the optimal schedules in the benchmark suite considered. The third curve presents the register requirements of the Iterative Modulo Scheduler, which presently does not attempt to minimize the register requirements. The gap between the last two curves is indicative of the degree of improvement that is possible with register-sensitive modulo scheduling algorithms. The next graph illustrates the number of additional registers needed relative to the MinReg Modulo-Scheduler when heuristic and MinReg stage schedulers are used on the MRT found by the Iterative Modulo Scheduler. The X-axis of Figure 5.8 represents the number of additional registers and the Y-axis represents the fraction of loops that can be scheduled with this number of additional registers. The curves represent the additional register requirements for the MinReg Stage-Scheduler, the best stage scheduling heuristic with unlimited schedule length (AC+3*UP+RSS), the 202

best stage scheduling heuristic with bounded schedule length (3*UP+RSS), and the Iterative Modulo Scheduler itself. We observe that the MinReg Stage-Scheduler nds a schedule with no additional register requirements in 65.9% of the loops in the 920-loop benchmark. It requires no more than 3 additional registers to schedule 95.0% of the loops, and may require up to 10 additional registers for the remaining loops. We may also notice that the curves for both stage scheduling heuristics, as in Chapter 4, nearly overlap with the MinReg Stage-Scheduler curve. In particular, the AC+3*UP+RSS and the bounded 3*UP+RSS heuristics nd a schedule with no additional register requirements in, respectively, 65.9% and 60.4% of the loops, require no more than 3 and 4 additional registers to schedule 95% of the loops, and never need more than 10 additional registers. In particular, the 3*UP+RSS stage scheduling heuristic with bounded schedule length combined with Rau's Iterative Modulo Scheduler [78] schedules 92.8% of the 920 loops with no more than 3 additional registers, and 98.3% with no more than 6 additional registers. Furthermore, these schedules with near optimal register requirements are obtained without impact on the performance, since the Iterative Modulo Scheduler solely focus on nding a schedule with low initiation interval and schedule length, and the bounded stage scheduling heuristic does not increase either of these while reducing the register requirements. As shown in Figure 5.8, the Iterative Modulo Scheduler results in a schedule with minimum register requirements in 44.9% of the loops, a surprisingly high percentage for a heuristic that does not attempt to minimize the register requirements. However, the overall register requirements of the Iterative Modulo Scheduler is severely degraded by a few loops that require up to 49 additional registers. Our next experiment investigates to what extent the stage scheduling approach degrades in loops with large II , since, unlike a modulo scheduler, a stage scheduler can only move operations by multiples of II cycles. For this experiment, we divided the 920 loops successfully scheduled by the MinReg Modulo-Scheduler into 7 distinct subsets, where each subset is associated with a range of II . Figures 5.9 and 5.10 plot, 203

for each of these subsets, the additional MaxLive that is required if, instead of using the MinReg Modulo-Scheduler, we use the MinReg Stage-Scheduler with the MRTs produced by the Iterative Modulo Scheduler. The rst observation is that, by de nition, no loop with II = 1 require any additional registers, as the scheduling space of the stage scheduling and the modulo scheduling approaches are identical for II = 1. We notice, however, that a signi cant fraction of the loops with larger II also have no additional register requirements The second observation is that II and the additional register requirements appear to be correlated by the following rule of thumb, which is particularly apparent in Figure 5.10: \A loop with an initiation interval of II > 1 cycle is likely to require no more than II additional registers." In our data set, for example, no loop with II = 2 requires more than 2 additional registers, only 2 loops with II = 3 need more than 3 additional registers, 1 loop with II = 4 requires more than 4 additional registers, 3 loops with II = 5 requires more than 5 additional registers, and so on. Recall, however, that these observations are derived from a subset of the 1327loop benchmark, and that conclusions drawn for this subset may not apply for the larger loops that are omitted here because they required more than 15 minutes of computation time for the MinReg Modulo-Scheduler. In our last experiment, we illustrate the bene t of precisely modeling the register requirements by investigating the register requirements of 752 loops successfully scheduled by each of the modulo schedulers, using the structured scheduling constraints and up to 15 minutes of CPU time per loop. Figure 5.11 illustrates the additional register requirements incurred when another scheduler is used instead of the MinReg Modulo-Scheduler. One interesting observation is that approximating the register requirements by AvgLive, as used by the MinLife Modulo-Scheduler, results in schedules with fewer additional register requirements than when approximating the register requirements by Buff , as used by the MinBu Modulo-Scheduler. For example, the MinLife Modulo-Scheduler nds schedules with no additional register requirements for 85.5% of the 752 loops, and never requires more than 2 additional registers. The the MinBu 204

0.7 II=50-200 (6 loops)

Fraction of the scheduled loops

0.6

II=21-50 (57 loops) 0.5 II=6-20 (66 loops) 0.4 II=4-5 (96 loops) 0.3

II=3 (87 loops)

0.2

II=2 (230 loops)

0.1

II=1 (378 loops)

0 0

1

2

3

4 5 6 Additional MaxLive

7

8

9

10

Figure 5.9: Additional register requirements of the MinReg StageScheduler relative to the MinReg Modulo-Scheduler in the 920-loop benchmark, broken down by loops in each II range.

Fraction of the scheduled loops

0.18 0.16

II=50-200 (6 loops)

0.14

II=21-50 (57 loops)

0.12

II=6-20 (66 loops)

0.1

II=4-5 (96 loops)

0.08 II=3 (87 loops) 0.06 II=2 (230 loops) 0.04 II=1 (378 loops)

0.02 0 0

1

2

3

4 5 6 Additional MaxLive

7

8

9

10

Figure 5.10: Detail of the additional register requirements of the MinReg Stage-Scheduler (blown up region of Figure 5.9). 205

1 0.9 Fraction of loops scheduled

0.8 0.7 0.6

MinLife Modulo-Scheduler

0.5

MinReg Stage-Scheduler

0.4

Heuristic AC+3*UP+RSS Heuristic 3*UP+RSS (Bounded)

0.3

MinBuff Modulo-Scheduler

0.2

Iterative Modulo Scheduler 0.1

NoObj Modulo-Scheduler

0 0

1

2

3

4

5 6 7 8 Additional MaxLive

9

10

11

12

Figure 5.11: Additional register requirements relative to the MinReg Modulo-Scheduler in the 752-loop benchmark. Modulo-Scheduler, however, nds schedules with no additional register requirements in only 65.0% of the 752 loops, and needs up to 10 additional registers. It is also interesting to note that using the MinReg Stage-Scheduler or one of the two stage scheduling heuristics to minimize MaxLive for the MRT selected by the Iterative Modulo Scheduler results in lower register requirements than minimizing bu ers (integral MaxLive) among all MRTs.

5.7 Summary Modulo scheduling is an ecient technique for exploiting instruction level parallelism in a variety of loops. It results in high performance code, but increases the register requirements. As the trend toward higher concurrency continues, whether due to using and exploiting machines with faster clocks and deeper pipelines, wider issue, or a combination of both, the register requirements will increase even more. As 206

a result, scheduling algorithms that reduce register pressure while scheduling for high throughput are increasingly important. This chapter presents an approach that schedules loop operations to achieve the minimum register requirements for the smallest possible initiation interval on machines with nite resources and arbitrary reservation tables. It is based on a well structured integer linear programming formulation that schedules medium sized loops (i.e. up to 41 operations, II of up to 118 cycles, and up to 5809 variables in the integer linear program) in a reasonable time (i.e. no more than 15 minutes on a HP-9000/715 workstation). Moreover, it precisely models the register requirements of a modulo schedule in each cycle of the schedule. Our contributions are threefold. First, we have increased the applicability of this approach by handling machines with nite resources and arbitrary reservation tables. The proposed formulation is guarantee to nd a feasible schedule for machine that satis es the mapping-free property. We demonstrated this increased modeling capability by precisely handling the resource requirements of the Cydra 5, a machine with complex reservation tables. Our second contribution is a more structured formulation of the modulo scheduling solution space. This more ecient formulation addresses a major concern with modulo schedulers that are based on integer linear programming solvers [83], which is their traditionally high execution time. Experimental evidence (gathered for a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels, compiled for the Cydra 5 machine) indicates that the number of branch-andbound nodes is decreased on average by a factor of 103, when using the structured instead of the traditional formulation of the MinReg Modulo-Scheduler on the 781 loops successfully scheduled by both formulations. It results, in turn, in a decrease in the total execution time by a factor of 8.6, from 870.2 to 101.0 seconds. Also, using the more ecient representation enabled us to successfully schedule a larger fraction of the 1327 loops; for example the coverage increases from 58.9 to 69.3% for the MinReg Modulo Scheduler and from 82.1 to 89.0% for the NoObj ModuloScheduler. Using this ecient formulation, the NoObj Modulo-Scheduler succeeded 207

in scheduling loops with up to 80 operations and with an II of up to 118 cycles, using no more than 15 minutes per loop on an HP-9000/715 workstation. Our third contribution is a precise formulation of the register requirements. Compared to approximating the register requirements by the average lifetime, minimizing MaxLive results in schedules with lower register requirements for 13.0% of the 1327 loops, and decreases the register requirements by up to 2 registers. Similarly, if the register requirements are approximated by bu ers (i.e. if only the integral part of MaxLive is considered), minimizing MaxLive results in schedules with lower register requirements for 21.4% of the 1327 loops, and decreases the register requirements by up to 10 registers. Furthermore, we used the MinReg Modulo Scheduler to assess the e ectiveness of the stage scheduling algorithms presented in Chapter 4 when used in conjunction with the Iterative Modulo Scheduler. First, our experimental evidence con rms the e ectiveness of the Iterative Modulo Scheduler (as proposed and implemented by Rau). We already knew that it achieves MII for 96.0% of the 1327 loops of a benchmark suite from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels, compiled for the Cydra 5 machine. Using the NoObj Modulo-Scheduler, we were able to show that the Iterative Modulo Scheduler nds a schedule with the smallest feasible II for 22 more loops, for a total of 97.7% of the 1327 loops. Among the remaining 2.3% of the loops, the NoObj Modulo-Scheduler succeeded in nding schedules with the lower feasible II for 25.8% of these loops, decreasing their II by up to 2 cycles. Second, our experimental ndings indicate that the stage scheduling approach presented in Chapter 4 achieves most of its decrease in register requirements, when using the MRTs produced by the Iterative Modulo Scheduler in a subset of the 1327loop benchmark that contains all the 920 loops successfully scheduled by the MinReg Modulo-Scheduler using no more than 15 minutes per loop on an HP-9000/715 workstation. Our results indicate that the average register requirements decrease by 22.2% for an optimal modulo scheduler with respect to a register-insensitive modulo scheduler. When using an optimal stage scheduler in conjunction with the Iterative Modulo Scheduler, the average register requirements decrease by 19.9% over the 208

920-loop benchmark suite. Also, the average register requirements decrease by 19.8 and 19.2%, respectively, when the stage scheduling heuristics AC+3*UP+RSS with unlimited schedule length and 3*UP+RSS with bounded schedule length are used in conjunction with the Iterative Modulo Scheduler. Thus, stage scheduling appears to provide almost all of the register reduction, with another few percent possible by coupling it with an algorithm that searches for a better MRT than the one provided by the Iterative Modulo Scheduler. However, the search for a better MRT is very computationally intensive process, with an average time of 13.2 seconds (and up to 15 minutes) for the MinReg Modulo-Scheduler in the 920-loop benchmark compared, for example, to an average time of 0.4 milliseconds (and up to 12.7 milliseconds) for the AC+3*UP+RSS stage scheduling heuristics. Thus, the use of a stage scheduling is highly recommended for production compilers.

209

CHAPTER 6 Register Allocation for Predicated Code Current research compilers for VLIW and superscalar machines focus on exposing more of the inherent parallelism in an application to obtain higher performance by better utilizing wider machines and reducing the schedule length of a code. There is generally insucient parallelism within individual basic blocks, higher levels of parallelism can be obtained by also exploiting the instruction level parallelism among successive basic blocks. A well established approach uses predication to merge several basic blocks into a single enlarged predicated block. This approach relies on predicated operations [48], a class of operations that complete normally when a logical expression, referred to as the predicate of an operation, evaluates to true. Otherwise, the predicated operation is transformed into a no-op and has no side e ect. For scalar code, the hyperblock approach [60][62] used in the IMPACT compiler combines frequently executed basic blocks from multiple execution paths into an enlarged predicated block. This technique enables a range of compiler optimizations and facilitates the task of the scheduler, resulting in greatly improved schedules. For loop code, conditional statements within the innermost loop are merged to enable ecient software pipelining techniques [49][77] which exploit the instruction level parallelism present among the iterations of a loop by overlapping the execution of consecutive loop iterations. IF-conversion [4][72] is the enabling technique used in both scalar and loop code to translate the selected basic blocks into an enlarged predicated block. 210

However, predicated blocks with high levels of parallelism result in higher register requirements, for two main reasons. First, the register requirements increase as more values are needed to support more concurrent operations. This e ect is inherent to parallelism and is exacerbated by wider machines and higher latency pipelines [64]. Second, we show in this chapter that the register requirements increase for predicated code as current compilers do not allocate registers as well in predicated code as in unpredicated code. Developing compiler techniques that exploit instruction level parallelism while containing the register requirements is crucial to the performance of future machines. In previous work, a virtual register in a single predicated block is considered live from the cycle in which it is rst de ned to the cycle in which it is last used, regardless of the predicates under which its def and use operations are guarded. In this chapter, we demonstrate that the register requirements can be decreased if this assumption is relaxed, i.e. if the predicate expressions are taken into account. The rst contribution of this chapter is a framework to analyze predicated codes. Information about predicated values are extracted from the code, live ranges and their predicate expressions are discovered, and live range interferences are computed. This framework can process arbitrary predication schemes and arbitrary relations among predicates, including correlations between IF-converted branches. This framework is applicable either before or after scheduling. The current implementation relies on a symbolic package to generate accurate results, using an approach similar to the one taken in the PARADIGM compiler [88]. For register allocators based on the classical graph coloring method, originally proposed by Chaitin [15][18], register allocation for predicted codes can be simply achieved by using a re ned interference graph instead of the conventional one. However, several register allocators depart from the graph coloring method: e.g. Hendren et al [46] investigate a framework based on cyclic interval graphs, introducing the notion of time in the register allocator paradigm. This additional notion of time is particularly useful for the live ranges of a loop, where live ranges may cross the boundary of an iteration. Another approach, investigated by Rau et al [82], proposes 211

a general framework for the allocation of registers in software pipelined loops for various code generation and hardware support schemes. The second contribution of this chapter is a set of heuristics that reduces the register requirements by allowing non-interfering virtual registers that overlap in time to share a common virtual register. We refer to this process as the bundling of compatible virtual registers. Bundling is similar to Chaitin-style coalescing [18] in that two or more virtual registers are combined; however, bundling di ers in that it combines virtual registers with distinct values. Bundling is performed after constructing the re ned interference graph, and before using a conventional register allocation based on either graph coloring, interval graphs, or Rau's software pipelining register allocator. Bundling is also applicable either before or after scheduling. We investigate the performance of several bundling heuristics in a benchmark suite of modulo scheduled loops extracted from the Perfect Club, SPEC-89, and the Livermore Kernels. To minimize the interaction between bundling and scheduling heuristics, we used the scheduling technique described in Chapter 4 which minimizes the register requirements for a modulo schedule. Preliminary results indicate that our best bundling heuristic decreases the average register requirements from 39.3 to 36.5 registers and increases the percentage of loops with no more than 64 registers from 85% to 95%. The average register requirements of our best heuristic is 1% above a predicate-sensitive lower bound. Preliminary results indicate that bundling compatible virtual registers is more successful when applied after rather than before scheduling. To our knowledge, the general issue of resource allocation in the context of predicated code has not been addressed in the literature. However, information about predicate values has recently been introduced in research compilers. In the IMPACT compiler, for example, information about predicates is stored in a Predicate Hierarchy Graph (PHG) [60][62] used to determine useful relations among predicate values. This information is used to re ne data ow analysis, optimization, scheduling, and allocation in presence of hyperblocks. Additionally, this information is used to conditionally reserve functional units when modulo scheduling under the Reverse-IF212

Conversion scheme [95]. A complementary approach is taken with the Gated Single Assignment (GSA) form [10], where precise predicate information is embedded in the data ow graph. This approach is used in the Polaris parallelizing compiler to re ne data and memory dependence analysis and to aid loop parallelization [90]. In this chapter, we illustrate the impact of predicated code on the register requirements and outline our general framework in Section 6.1. The bundling heuristics after scheduling and before scheduling are respectively introduced in Sections 6.2 and 6.3. We present experimental results in Section 6.4 and conclude in Section 6.5.

6.1 Predication and Live Ranges In this section, we illustrate why current compilers succeed in accurately determining the live ranges of virtual registers in regular basic blocks, but fail when these basic blocks are converted to a single predicated block. We illustrate this negative impact by investigating the register requirements of a simple conditional statement before and after IF-conversion. We propose a solution that eliminates this negative impact by developing a precise representation of the live range in single-entry single-exit predicated blocks.

Example 6.1 Our example is based on the simple conditional statement presented

in Figure 6.1a. In this example, the value of y is de ned by one of two expressions of x depending on the value of x. Figure 6.1b shows the ow graph which has four basic blocks. Note that either bb1 or bb2 will be executed, but not both, and they both store their result in virtual register vr3. We assume that there are no further references to any of these virtual registers before the rst or after the last basic block.

During the register allocation phase, an interference graph [15][18] is typically constructed to determine the virtual registers that must be assigned to distinct physical registers. An interference graph corresponds to an undirected graph in which the vertices represent the virtual registers and an edge connects two vertices if one is live 213

a) Conditional statement if (x>10) y = 3*(x+2);

else y = 4*x+5;

c) Interference graph

b) Flow graph and live ranges for each basic block bb0

vr0

vr0 = load x vr0 vr0 vr1 vr3

bb1

t

vr1 = vr0+2 vr3 = 3*vr1

vr0>10

f

bb2

vr0 vr2 vr3

vr2 = 4*vr0 vr3 = vr2+5

vr1

bb3 vr3 Live ranges

vr2

vr3

store vr3,y

Figure 6.1: Register requirements of a conditional statement (Example 6.1). at the point where the other is de ned. To determine the live ranges, global data ow analysis is performed on the ow graph to determine the set of virtual registers that are live-in at the entry point of each basic block, and those that are live-out at the exit point of each basic block. Then, local analysis for each basic block determines the cycle-by-cycle live range of each virtual register that is live-in, live-out, and/or referenced in the basic block [2, pp 534{525]. This second step is highly machine dependent; we assume in our examples and experiments that a virtual register is reserved in the cycle where its earliest-de ne operation is scheduled and becomes free in the cycle following its last-use operation. However, our formulation treats the cycles in which a register is reserved and freed as parameters. Figure 6.1c presents the resulting interference graph. In bb1 for example, vr0 is live when vr1 is de ned; therefore, there is an edge between vertices vr0 and vr1 in the interference graph. Notice that there is no edge between vertices vr1 and vr2 since the thread of execution moves exclusively to one of the two basic blocks bb1 and bb2. The resulting interference graph can be colored by no less than 2 colors; consequently, the (minimum) register requirements of this conditional statement is 2. 214

a) Predicated block and live ranges bb0−1−2−3

Interference graph vr0 vr1 vr2 vr3

b) Traditional

c) Refined

vr0=load x vr0

p1(u),p2(u)=(vr0>10) vr1=vr0+2 ?p1;

vr2=4*vr0 ?p2

vr3=3*vr1 ?p1;

vr3=vr2+5 ?p2

vr1

vr0

vr2

vr1

vr2

store vr3,y Unconditional live ranges

Live range under p1

Live range under p2

vr3

vr3

Figure 6.2: Register requirements of an IF-converted conditional statement (Example 6.1). We now investigate the impact of predication on the register requirements of Example 6.1. Using IF-conversion, the four basic blocks of Figure 6.1b are merged into the single predicated block of Figure 6.2a. The operations previously executed in bb1 and bb2 are guarded, respectively, by the predicates p1 and p2. The two predicates are de ned in the second line by a special operation which sets p1 to true if the comparison succeeds and false otherwise. Similarly, p2 is de ned by the same operation to true if the comparison fails and false otherwise. Note that the values of p1 and p2 are disjoint, i.e. p1 and p2 cannot both evaluate to true for any instance of vr0. An interference graph using traditional live ranges is shown in Figure 6.2b. This interference graph contains an edge between vr1 and vr2 because vr1 and vr2 are considered live simultaneously, as traditional analysis does not examine the predicate expressions. As a result, the register requirements of this predicated block increase from 2 to 3. This increase in register requirements is mainly due to two reasons. First, without knowledge about predicates, the data ow analysis must make conservative assumptions about the side e ects of the predicated operations [2]. Second, the solutions of the data ow analysis rely heavily on the connection topology among basic blocks in the ow graph, which is altered by the IF-conversion process. 215

To eliminate the negative impact of predicated operations on the register requirements of a predicated block, one solution is to maintain the ow graph as predicated blocks are formed. This approach presents the advantage of using existing compiler techniques, but maintaining a consistent data ow graph through the several optimization phases may be complex and time consuming. For example, extracting instruction level parallelism relies heavily on moving operations among predicated blocks and on speculating operations, e.g. by promoting the predicate guarding operations. Therefore, all optimizations modifying predicates would have to maintain the consistency of the data ow graph at each step. Another approach is to reconstruct a data ow graph from the predicated blocks each time data ow analysis is needed [60]. This approach eliminates the complexity of maintaining a consistent data ow graph and requires no changes to the classical data ow analysis, traditional optimizations and register allocation. Additionally, the complexity of this approach is low, resulting in low compile time and good maintainability. However, this approach may produce in conservative results. Consider the code of Figure 6.3a where operation vr4=vr4+1 can be scheduled in any cycle after the load operation. If this operation is scheduled in cycle 2 (or in any cycles before or after all the predicated operations), reconstructing a ow graph results in the topology of Figure 6.3b. When scheduled in cycle 4, however, the graph is as in Figure 6.3d. Determining precise interference would then require the analysis phase to realize that both conditionals are correlated. In this chapter, we investigate an approach that directly takes into account the predicate expressions under which virtual registers are live. This approach presents none of the cited drawbacks of the previous approaches; moreover, precise information about predicate expressions of live ranges may also be extremely useful to other optimizations, such as memory disambiguation [90]. The drawback of this approach is its reliance on a more expensive analysis, partly due to the use of a symbolic package to detect predicate expression disjointness and partly due to the computation of a more complicated interference relation. However, ongoing research on ecient algorithms for precise or approximate analysis of predicated code [51] will signi cantly improve 216

a) Increment operation in cycle 2

b) Equivalent flow graph vr0 = load x vr4=vr4+1;

vr0=load vr4; vr4=vr4+1;

t

p1(u),p2(u)=(vr0>10)

f

vr0>10

vr1 = vr0+2 vr3 = 3*vr1

vr2 = 4*vr0 vr3 = vr2+5

vr1=vr0+2 ?p1; vr2=4*vr0 ?p2 vr3=3*vr1 ?p1; vr3=vr2+5 ?p2 store vr3,y

store vr3,y

c) Increment operation in cycle 4

d) Equivalent flow graph vr0 = load x

vr0=load vr4;

t

vr0>10

vr1 = vr0+2

p1(u),p2(u)=(vr0>10) vr1=vr0+2 ?p1; vr2=4*vr0 ?p2 vr4=vr4+1; vr3=3*vr1 ?p1; vr3=vr2+5 ?p2

f

vr2 = 4*vr0

vr4=vr4+1; t vr3 = 3*vr1

vr0>10

f

vr3 = vr2+5

store vr3,y store vr3,y

Figure 6.3: Reconstructing a data ow graph. the execution time of this approach. We now present the mechanisms required by our framework in more detail.

6.1.1 Predicate Extraction The predicate extraction mechanism allows the compiler to nd how predicates are de ned and used. This knowledge is expressed as logically invariant expressions, or P-facts, that are guaranteed to hold, regardless of the execution trace, the results of comparisons, and the values of other predicates. These P-facts will be used in later phases of the analysis to determine feasible execution paths, i.e. execution paths that do not violate any of the P-facts gathered during this initial extraction phase. Note that failing to extract a P-fact does not compromise the integrity of the analysis, it merely causes more conservative results. This extraction mechanism is very sensitive to the instruction set architecture, since it relies on the precise semantics of the operations that de ne the predicate val217

ues. In this work, we adopt the predication scheme of the HPL PlayDoh architecture [54]. Support for predicated execution is provided by a le of 1-bit predicate registers which holds the predicate values, an additional source operand for most operations to specify a predicate register, and a rich set of operations to de ne values for these predicate registers. Predicate registers are allocated like any other registers in the machine. Each predicated operation is associated with an additional predicate register operand which speci es the predicate value under which the operation may a ect the processor state. The semantics of a predicated operation is de ned as follows:

op>

Dest =
), Pout1() = (src0 src1) ? Pin

Pout0(
15) S1; else S2;

p1(u),p2(u) = (i>15) ?p0 S1 ?p1; S2 ?p2;

if(i>10) S4; if(i>5) S5;

p4(u) = (i>10) ?p3; p5(u) = (i>5) ?p3; S4 ?p4; S5 ?p5;

P-facts p1 , (x1 ^ p0) p2 , (x1 ^ p0) p4 , (x2 ^ p3) p5 , (x3 ^ p3) x 2 ) x3

Table 6.1: Extracting P-facts. types are required to evaluate several comparisons in parallel, as advocated in [87]. For example, the or type may be used to implement a \wired-or" function of several comparisons executed in parallel. Based on the precise semantics of the predicate type used by a compare-to-predicate operation, we may construct a corresponding logically invariant expression that is guaranteed to hold regardless of the run time program values. Table 6.1 presents the P-facts extracted from two simple code fragments. The rst example illustrates a simple predicated if-then-else construct similar to Example 6.1. Following the unconditional type semantics (u type), the rst P-fact asserts that p1 is true when both the comparison and the input predicate p0 are true. Following the negated unconditional type semantics (u type), the second P-fact asserts that p2 is true when the comparison is false and the input predicate p0 is true. P-facts do not make any assumptions about the result of a comparison: P-facts merely assume that a comparison will evaluate either to true or false, resulting in the described e ect on the predicate values. Therefore, comparisons are simply abstracted as logical variables; in our rst example, the comparison (i>15) is abstracted as the variable x1. The second example in Table 6.1 illustrates two correlated predicates. We notice that the two conditions (i>10) and (i>5) are correlated, since i larger than 10 implies i larger than 5. The third P-fact expresses this correlation.

219

6.1.2 Live Range Analysis In this section, we compute the predicate expression under which a virtual register is live. We consider live ranges de ned and used by one or more operations within a single predicated block. Consider a virtual register vr and operation i, producing a value of vr during the cycle interval tpei to tpli, namely from the earliest production time to the latest in which operation i may write back its results in vr. For example, the earliest production time of a divide operation corresponds to the earliest time where the divide operation may write its result back (e.g. divide by 1), and the latest production time corresponds to the latest possible time. Consider also operation j , consuming a value of vr during the cycle interval tsej to tslj , namely from the earliest sampling time to the latest sampling time in which operation j may read the value of vr. In a single unpredicated basic block, a value ows from operation i to operation j only if no intervening operation produces a value for the same virtual register between operations i and j . In a predicated block however, the predicate under which operations execute must be taken into consideration. A value e ectively ows from operation i to j if the predicates of operations i and j both evaluate to true and the predicates of all producer operations that could overwrite that value evaluate to false. The exact condition under which a value ows from operation i to operation j is referred to as the ow condition and is determined by the following conjunction of predicates:

pi ^ pj

^

pk

(6.2)

tpek > tpli tplk < tsej

where the pk are the predicates of all the def-operations of vr that may potentially overwrite vr. Figure 6.4a illustrates a data ow graph with two producer and two consumer operations of virtual register x. The ow condition associated with each edge is also 220

a) Data flow graph tpe tpl

tpe tpl

b) Flow− c) Def−components (p0)

x= ?p0 (p0 & ~p1 & p2)

(p0 & ~p1 & p2)

(p0 & ~p1 & p3)

(p0 & ~p1 & p3) (p1)

x= ?p1

(p1 & p2)

(p1 & p2) tse tsl

(p1 & p3)

=x ?p2 (p1 & p3)

tse tsl

time

=x; ?p3 (flow conditions) producer

consumer

(def conditions) reaching definition

Figure 6.4: Flow and Def component of a live range. presented in this gure. For example, a value e ectively ows from the rst producer to the last consumer when the condition p0 ^ p1 ^ p3 evaluates to true, i.e. when both the rst producer and the last consumer operations execute, and the intervening producer operation does not. We investigate now the contribution to the live range of vr generated by operations i and j . One contribution to the live range of vr is associated with the data ow between operations i and j . This contribution is associated with the cycle range [tpei; tslj ]. This contribution is referred to as the ow component and is live under the ow condition associated with the edge between operations i and j . Figure 6.4b illustrates the ow component for each of producer-consumer pair of x. Another contribution is associated with the cycle range [tpei; tpli]. This contribution is referred to as the def component and is live under the predicate guarding operation i, referred to as the def condition. Note that the def components exist regardless of whether the produced values are used. Figure 6.4c illustrates the def components for the producers of x.

221

b) Predicated live range y

a) Predicated live range x (p0)

x= ?p0

(p0 & ~p1 & p2) (p0 & ~p1 & p3) 1 x= ?p1

(p4)

y= ?p4

(p1) 2

(p4 & p5)

(p1 & p2) (p1 & p3) =x ?p2

c) Interference if and only if 1

(p0 & ~p1 & p2 & p4)| (p0 & ~p1 & p3 & p4)

2

(p1 & p4 & p5)

time

=x; ?p3 flow component

=y ?p5

def component

Figure 6.5: Determining interferences.

6.1.3 Interference In this section, we compute the interference between two predicated live ranges based on Chaitin's de nition: two live ranges interfere if one of them is live at a de nition point of the other [18]. This de nition implies that overlapping def components do not interfere, provided no operations use these values. In the context of predicated blocks, this de nition implies that simultaneous writes to a unique physical register should be tolerated by the hardware of the machine. In this chapter, we assume that simultaneous writes are allowed, resulting in an unspeci ed value. Figures 6.5a and 6.5b illustrate the predicated live ranges of virtual registers x and y, respectively. We notice that two def components, marked 1 and 2, each overlap the ow components of the other virtual register. We use the ow and def conditions associated with each component to determine the predicate expression under which x and y may interfere, as shown in Figure 6.5c. For example, the second def component of x interferes with the ow component of y (interval labeled 2) if there is a legal execution path in which expression p1 ^ p4 ^ p5 evaluates to true. To see if this is possible, the compiler queries whether this expression is consistent with all of the P-facts, which are known to be satis ed by 222

all legal paths. If the conjunction of the query expression with all the P-facts does not evaluate to false, the compiler must assume that there is some legal path that may satisfy the query expression, i.e. that there may be interference between x and y. Otherwise, the conjunction evaluates to false, indicating that there is no legal execution path that satis es consistency of the query expression with all the P-facts, i.e. these particular components of x and y do not interfere. In this chapter, we consider query expressions that are of the form pi1 ^ : : : ^ pin ^ pin +1 ^ : : : ^ pin , i.e. a conjunction of predicate terms and negated predicate terms. Consistency between a query expression expr and the P-facts is checked by function consistent(expr). This function investigates whether the following equation can be proven: ! m ^ (6.3) P-factk ^ (expr) 6= false 0

0

k=1

returning true when Inequality (6.3) is proven true, and false otherwise. This function can be implemented in several ways; in this work, we use a symbolic package that simpli es logic equations. An interference algorithm based on Chaitin's de nition of interference is presented in Figure 6.6. Note that this algorithm relies on the speci c scheduling time of each operation. When interferences are computed before scheduling, we must make a conservative assumption on the scheduling time of each producer and consumer operation. Function Live-at-def investigates if any of the def components of x may interfere with any of the ow components of y. These tests are performed in function Def-at- ow. Consider prod-x, prod-y, and cons-y to be, respectively, a producer of x, a producer of y, and a consumer of prod-y. First, Def-at- ow determines if the def component of prod-x overlaps in time with the ow component associated with the producer-consumer pair (prod-y, cons-y). If it does, the ow condition expr is computed, as de ned in Section 6.1.2, and the conjunction of the def condition and the ow condition is checked for consistency with the P-facts. When it is, the compiler must assume that prod-x may interfere with the value owing from prod-y 223

procedure Def-at- ow (Op prod-x, Op prod-y, Op cons-y) f

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

g

/* Determine if prod-x overwrite the value owing from prod-y to cons-y. */ if def-component(prod-x) overlaps with

ow-component(prod-y, cons-y) then expr = predicate(prod-y) and predicate(cons-y); for each other producer of y strictly in interval(prod-y, cons-y): prod-y' do expr = expr and not predicate(prod-y');

endfor return consistent(expr and predicate(prod-x)); endif return false;

procedure Live-at-def (Live Range x, Live Range y) f 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

g

/* Determine if the producers of x interfere with live range y */ for each producer of x: prod-x do for each producer of y: prod-y do for each consumer of prod-y: cons-y do if Def-at- ow(prod-x, prod-y, cons-y) then return true;

endif endfor endfor endfor return false;

procedure Interference (Live Range x, Live Range y) f 25: 26:

g

/* Determine if live ranges x and y interfere */ return Live-at-def(x, y) or Live-at-def(y, x);

Figure 6.6: Interference algorithm. 224

to cons-y and thus consider the live ranges x and y to interfere. If no def component of x or y is detected to interfere with a ow component of the other, then vrx and vry do not interfere and the edge between them is removed from the interference graph. As presented in Figure 6.6, the algorithm has a computational complexity of O(n4), where n is the number of producers and consumers. However, the loop in line 6 can be computed in advance for each live range, thus reducing the complexity of the algorithm to O(n3). To reduce the cost of detecting live range interferences, a regular interference algorithm can be used rst, followed by a predicated analysis phase used to remove spurious interference edges.

6.2 Post-scheduling Bundling For register allocators based on Chaitin's graph coloring framework [15][18], register allocation for predicated code can be achieved simply by using the re ned interference graph instead of the conventional one. However, several register allocators depart from the graph coloring method [46][82] as graph coloring methods do not provide a notion of time that is particularly useful for the live ranges of a loop, which may cross the boundary of an iteration. Also, non-traditional constraints such as the one presented in [82] to support various code generation and hardware support schemes are dicult to express within the graph coloring framework. To address the allocation of registers in loops with predicated code, we propose a technique that reduces the register requirements by allowing non-interfering virtual registers that overlap in time to share a common virtual register. Once virtual registers have been suitably bundled, a traditional predicate-insensitive register allocator can be used. In this section, we investigate this technique, referred to as bundling, after completion of the scheduling process. The key issue is to determine which virtual registers should be bundled together. We propose a parameterized heuristic that determines advantageous bundling. 225

First, the interferences between virtual registers of a predicated block are computed using the algorithm presented in Section 6.1.3. Second, the virtual registers are placed on the list of unprocessed virtual registers, ordered according to a heuristic. Then, the bundling algorithm selects the rst virtual register on that list and assigns it to a suitable bundle. Several selection heuristics can be speci ed to identify a suitable bundle. A simple heuristic chooses the rst bundle in which the current virtual register does not interfere with any of the virtual registers already in the bundle. A more elaborate heuristic would select the bundle that minimizes some cost function. Once a bundle is selected, the current virtual register is added to the bundle and the bundling heuristic selects the next unprocessed virtual register. This algorithm thus maps virtual registers to bundles. The virtual registers in a bundle are guaranteed to be compatible (no pairwise interference), and can be reassigned to a unique virtual register. The remainder of this section discusses the ordering and the bundle selection heuristics. In this discussion, we de ne the live range of a virtual register as the interval [tpei ; tslj ] where operations i and j are, respectively, the earliest producer and latest consumer of that virtual register. By extension, we de ne the live range of a bundle as the union of the live ranges of all virtual registers of that bundle. Furthermore, a virtual register is compatible with a bundle if it does not interfere with any of the virtual registers of that bundle.

6.2.1 Ordering Heuristics Arranging the virtual registers in a suitable order is critical for heuristics that consider the bundling of registers only one virtual register at a time. We present here some of the ordering heuristics we investigated. Ordering heuristics can be combined to break ties.

Start-Time. Arrange the virtual registers by increasing start-time of their live ranges. This heuristic tries to capture the time locality in the dependence graph.

226

Decreasing-Length. Arrange the virtual registers by decreasing length of their live ranges. Bundling virtual registers with long live ranges rst is expected to result in a larger decrease in register requirements.

Decreasing-Interference. Arrange the virtual registers by decreasing number of

interferences. Bundling virtual registers that interfere with the largest number of virtual registers rst is expected to result in a larger decrease in register requirements.

6.2.2 Selection Heuristics The selection heuristics place the current virtual register in a suitable bundle. When a selection heuristic tries every bundle unsuccessfully, a new bundle is created at the end of the bundle list and the current virtual register is added to it.

First-Compatible Bundle. Choose the rst bundle for which the current virtual

register is compatible and their live ranges intersect.

Last-Compatible Bundle. Choose the last bundle for which the current virtual register is compatible and their live ranges intersect. This heuristic presumes that more recently created bundles are more likely to accept new virtual registers than older bundles.

Best-Compatible Bundle. Choose the best bundle in which the current virtual

register is compatible and their live ranges intersect. The best bundle is the one that maximizes the intersection of the current virtual register live range and the bundle live range (empty intersection not considered).

Best Bundle. Choose the best bundle in which the current virtual register is par-

tially compatible and their live ranges intersect. This heuristic rst computes for each bundle the set of virtual registers that interfere (inter) and the set the set of virtual registers that do not interfere (comp) with the current virtual register (vr). Then, it computes the number of cycles (vr in) in which the live ranges of comp and the live range of vr intersect. It also computes the number of cycles (vr out) in which one 227

or more live ranges of comp and the live ranges of inter intersect. The best bundle is the one that maximizes vr in - vr out (values  0 not considered). If a bundle is selected, the virtual registers in inter are removed from the bundle and restacked on the list of unprocessed virtual registers. This selection heuristic is guaranteed to complete because virtual registers are restacked only if the overall bene t strictly increases (note that the maximum possible overall bene t is nite).

6.3 Pre-scheduling Bundling We investigate in this section the process of bundling compatible live ranges before scheduling. The motivation for bundling before scheduling is that a register-sensitive scheduler may better schedule the operations of a bundle if the register-sensitive scheduler is aware of which virtual registers belong to a bundle. With pre-scheduling bundling heuristics, interferences between virtual registers of a predicated block cannot be computed as precisely as in Section 6.1.3 since the scheduling time of the operations is unknown. Instead, we must rely on the earliest and latest feasible scheduling time of an operation [43] [49]. In our current implementation, the pre-scheduling bundling heuristics rely on a weaker form of interference where two virtual registers are considered to interfere if any two of their producer operations do not execute under mutually disjoint predicates. A more elaborate approach might alleviate this restriction by adding scheduling edges in the dependence graph to enable advantageous bundles. In this scheme, additional scheduling edges would supply some further temporal constraints on the subsequent scheduling. Some edges are guaranteed not to actually constrain the scheduler, as shown by Pinter [75]. Others, however, may constrain the scheduler. In the context of modulo scheduled loops, the danger of this approach resides in the fact that adding scheduling edges may, if not carefully applied, overly constrain the schedule of predicated blocks and may thus result in schedules with increased schedule length or decreased throughput. For this reason, we did not investigate this approach. 228

6.4 Measurements We investigated the register requirements of the integer and oating point register le for a subset of the 1327-loop benchmark suite as compiled for the Cydra 5 and presented in Section 2.7. From the 1327-loop benchmark suite, we selected all the loops with at least one pair of mutually disjoint predicate values, resulting in a benchmark of 21 loops. In the current implementation, we did not attempt to correlate any of the IF-converted branches. Table 6.2 summarizes the principal characteristics of the benchmark suite. The loops in the benchmark suite are signi cantly more complex than the average loop in the input set, with 51.43 operations and 61.62 register edges on average, versus 17.54 and 22.30 for the input set, respectively, as reported in Table 4.2. As a result of this bias, the loops of the benchmark suite require an average of 39.3 registers versus 27.9 in the input set when using the scheduling technique described in Chapter 4 which minimizes the register requirements for a modulo schedule. As it contains loops with larger register requirements, decreasing the register requirements of the loops in this benchmark suite may be crucial. Measurements: min freq median average max number of operations 11:00 4:8% 50:00 51:43 107:00 register edges 9:00 4:8% 62:00 61:62 132:00 scheduling edges 12:00 4:8% 66:00 69:48 155:00 II 4:00 4:8% 16:00 26:86 109:00 number of predicate values 2:00 50:0% 2:00 3:05 9:00 disjoint pairs of predicates 1:00 72:7% 1:00 1:95 13:00

Table 6.2: Characteristics of the 21-loop benchmark suite. In this chapter, we used the MinReg Stage-Scheduler and Iterative Modulo Scheduler to produce modulo schedules with high performance and low register requirements.

229

Fraction of loops scheduled

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3

Schedule Independent Lower Bound

0.2

Best Post-Scheduling Bundling Heuristic

0.1

MinReg Stage Scheduler Iterative Modulo Scheduler

0 8

16

24

32

40

48

56

64

72

80

88

Registers

Figure 6.7: Register requirements for the 21-loop benchmark suite. We rst investigate the register requirements of these two schedulers and the register requirements of our best bundling heuristic1. Figure 6.7 presents the fraction of the loops in the benchmark suite that can be scheduled for a machine with any given number of registers without spilling and without increasing II . The \Schedule Independent Lower Bound" curve corresponds to [49] and is representative of the register requirements of a machine without any resource con icts, but does not support bundling, i.e. it ignores the possible decrease in register requirements due to the bundling of compatible virtual registers. Notice that none of the schedulers, including our best bundling heuristic, approaches the register requirements of the Schedule Independent Lower Bound. This large gap is partly explained by the strong simpli cation (no resource con ict) of the lower bound, a gap that is particularly apparent for large loops and negligible for small loops [36]. The second and third curves illustrate the decrease in register requirements due to The best bundling heuristic corresponds to a post-scheduling bundling heuristic using the Decreasing-Length ordering (with Decreasing-Interference for tie-breaking) and the Best-Bundle selection. 1

230

the bundling of compatible virtual registers when applied to the schedules produced by the MinReg Stage Scheduler. Using our best bundling heuristic, the average register requirements decrease from 39.3 to 36.5 registers and the number of loops with no more than 64 registers increases from 18 to 20 of the 21 loops. Note that this improvement is obtained only for schedules for the given MRT. The lowest curve illustrates the register requirements associated with the Iterative Modulo Scheduler and is indicative of the register requirements of a good modulo scheduler that currently does not attempt to minimize the register requirements. To investigate the performance of the bundling heuristics, we introduce a lower bound on the register requirements:

Predicate-Sensitive Lower Bound. A lower bound on the register requirements

for a given loop schedule and set of P-facts is obtained by bundling the virtual registers on a cycle-by-cycle basis. This clearly represents a lower bound, as normally virtual registers are allowed to be bundled on a register-by-register basis only.

We focus here on the decrease in register requirements achieved by a heuristic, normalized to the decrease in register requirements from no-bundling to the PredicateSensitive Lower Bound. The schedule used to investigate the post-scheduling bundling heuristics was obtained by the MinReg Stage Scheduler. In our benchmark suite, the average register requirements without bundling was 39.3 and the Predicate-Sensitive Lower Bound was 36.2. In the following paragraphs, a decrease in register requirements of 100% corresponds to a bundling heuristic that achieves the lower bound and 0% corresponds to a bundling heuristic that does not decrease the register requirements at all. All numbers are averaged over the 21 loops of the benchmark suite. We rst investigate the performance of the bundle selection heuristics presented in Section 6.2.2. Table 6.3 presents the normalized decrease in register requirements associated with each of four selection heuristics, averaged over all loops and ordering heuristics. Note that the last two selection heuristics require a cost function to evaluate the intersections of the virtual register live range and the bundle live range. We 231

Cost Function none intersection Bundle Selection Heuristic First-Compatible 72% | Last-Compatible 75% | Best-Compatible | 72% Best | 90%

Table 6.3: Normalized decrease in register requirements for the selection heuristic. may conclude from Table 6.3 that the Last-Compatible Bundle Heuristic is the better selection heuristic without cost function, and that Best Bundle Heuristic, which is the least sensitive to the ordering of the virtual registers, is the best selection heuristic overall. Cost Function Ordering Heuristic Reference Start-Time Length Interference Length+Interference

none intersection 61% 80%

85% 76% 85%

76% 85% 86% 84%

87%

Table 6.4: Normalized decrease in register requirements for the ordering heuristics. The performance of the ordering heuristics described in Section 6.2.1 is summarized in Table 6.4. Each table entry corresponds to the normalized decrease in register requirements achieved by an ordering heuristic, averaged over all loops and selection heuristics that share the same cost function. The rst row, labeled \Reference" corresponds to the decrease in register requirements averaged over the initial and reverse orderings. Comparing the numbers of the other ordering heuristics to the reference numbers provides insight on the achieved decrease in register require232

Fraction of loops scheduled

1 0.9 0.8 0.7 0.6 0.5

Post-Scheduling bundling heuristic Best/Length+Interf.

0.4 0.3

Post-Scheduling bundling heuristic Last Compatible/Length

0.2

Pre-Scheduling Bundling Heuristic Best/Length MinReg Stage Scheduler

0.1

Iterative Modulo Scheduler 0 0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

Additional registers

Figure 6.8: Additional register requirements of the pre- and postscheduling bundling heuristics relative to the PredicateSensitive Lower Bound. ments due to the ordering heuristics. We conclude from Table 6.4 that the ordering heuristics do a ect the register requirements, and that the largest decrease is generated by the Length+Interference ordering heuristic which uses Decreasing-Length with Decreasing-Interference for tie-breaking. The individual results con rm the above ndings, with the best results for the Best/Length+Interference, Best/Length, and Best/Start-Time heuristics, decreasing the average register requirements from 39.3 to 36.52 registers, followed by Best/Interference and Last-Compatible/Length, decreasing the average register requirements form 39.3 to 36.57 registers. All of these ve post-scheduling bundling heuristics are less than 1% above the Predicate-Sensitive Lower Bound. Figure 6.8 illustrates the number of additional registers needed by the bundling heuristics and other scheduling algorithms when compared to the Predicate-Sensitive Lower Bound. From top to bottom, the curves represent the additional registers needed by two post-scheduling bundling heuristics, a pre-scheduling heuristic, the 233

MinReg Stage Scheduler, and the Iterative Modulo Scheduler. Notice that the two post-scheduling bundling heuristics reduce the register requirements signi cantly, with all loops scheduled using no more than 2 additional registers above the possibly unattainable Predicate-Sensitive Lower Bound. The decrease in register requirements of the pre-scheduling bundling heuristic is signi cantly less, with all loops scheduled using no more than 8 additional registers. This disappointing performance is due, we believe, to the fact that our pre-scheduling bundling approach uses a more conservative de nition of non-interfering live range. A more elaborate approach could alleviate this situation by adding scheduling edges in the dependence graph to enable advantageous bundling, at the risk of increasing the schedule length or decreasing the throughput of the schedule. In its current form however, experimental evidence suggests that the bundling of compatible virtual registers, without adding scheduling edges in the dependence graph, is more successful as a post-scheduling optimization phase than as a pre-scheduling phase.

6.5 Summary In this chapter, we discussed the need for better analysis of predicated code by showing an example of code where the register requirements increase when several basic blocks are merged into a single predicated block. To remedy this increase in register requirements, we proposed a framework that analyzes predicated codes by extracting P-facts from the predicated block, analyzing the predicate expressions associated with live ranges, and re ning Chaitin's interference de nition to account for predicate expressions. This framework generalizes to arbitrary predication schemes and may capture correlations between IF-converted branches. Register allocators based on the graph coloring framework can simply use the re ned interference graph to perform register allocation on predicated code. For register allocators based on interval graphs, or others, we propose a new technique that bundles non-interfering virtual registers to lower the register requirements of a predicated block. Bundling may be performed after the scheduling process to capitalize on 234

the availability of the actual scheduling time of each operation to determine precisely the interferences among virtual registers. Alternatively, bundling before scheduling capitalizes on the fact that a register-sensitive scheduler may better schedule the operations associated with a pre-bundled virtual register. Results using our best post-scheduling bundling heuristic decreased the average register requirements from 39.3 to 36.5 registers, just 1% above the Predicate-Sensitive Lower Bound, and increased the number of loops with no more than 64 registers from 18 to 20, in a benchmark suite of 21 modulo scheduled loops extracted from the Perfect Club, SPEC-89, and the Livermore Kernels. These results also suggest that bundling compatible virtual registers, without adding scheduling edges in the dependence graph, is more successful when applied after rather than before scheduling.

235

CHAPTER 7 Conclusions 7.1 Summary of Contributions High performance for VLIW and superscalar processors is achieved by exposing the inherent parallelism in an application, capitalizing on the exposed instruction level parallelism to achieve an ecient instruction schedule that utilizes the machine well, avoids resource con icts, and minimizes unmasked stall cycles. To better utilize machine resources and reduce or prevent resource contentions, high performance compilers use precisely detailed microarchitecture representations. Ecient representations are important since high performance compilers spend a signi cant amount of compilation time scheduling operations, and thus testing for potential resource contentions. For example, our measurements indicate that approximately half the scheduling time is spent in the contention query module, when using a state-of-the-art modulo scheduler in a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels compiled for the Cydra 5. We addressed this issue in the context of high-performance compilers by proposing an ecient contention query module that supports the elaborate scheduling techniques used by today's high performance compilers. Our module is based on a reduced machine description that results in signi cantly faster detection of resource contentions while exactly preserving the scheduling constraints present in the original machine description. This approach achieves two major goals. First, it handles queries 236

signi cantly faster: for example, dynamic measurements obtained when scheduling the 1327-loop benchmark for the Cydra 5 using the Iterative Modulo Scheduler [80] indicates an average decrease in work units (i.e. the essential work performed by the contention queries) by a factor of 2.76, resulting in queries that complete on average in 58.2% of the time when a reduced machine description is used instead of the original machine representation. Second, the reduced machine description approach does not limit the functionality of the contention query module. Additional functionality that is fully supported here, and used in our experiments, allows scheduling an operation earlier than others that are already scheduled, unscheduling operations due to resource contentions, and ecient handling of periodic resource requirements found in software pipelined schedules. Thus, by using our reduced machine description, the computational requirements of these scheduling algorithms, and other similar algorithms, should decrease as they would no longer be penalized by less ecient machine resource models. While most machine resources can be eciently modeled using contention query modules, other machine resources, such as registers, are typically not handled by such modules. In the case of the registers, this problem is further aggravated by the fact that schedules exhibiting high concurrency generally result in higher register requirements as well. We addressed this issue by investigating register-sensitive schedulers in the context of software pipelined loops. Our approach combines optimal algorithms to gain understanding and fast heuristics to improve compiler technology. We rst investigated an optimal algorithm for modulo scheduling (MinReg ModuloScheduler) that schedules loop operations to achieve the minimum register requirements for the smallest possible initiation interval. Our approach has extended the applicability of optimal modulo scheduling algorithms to machines with nite resources and arbitrary reservation tables. We demonstrated this increased modeling capability by precisely handling the resource requirements of the Cydra 5, a machine with complex reservation tables. Our approach also uses a well structured formulation of the modulo scheduling solution space, which signi cantly reduces the traditionally high execution time of 237

modulo schedulers based on integer linear programming formulations. Experimental evidence with the 1327-loop benchmark indicates an average decrease in the number of branch-and-bound nodes (visited by the solver) by a factor of 103, resulting in an average decrease in execution time by a factor of 8.6, when using the structured formulation instead of the traditional formulation of the MinReg Modulo-Scheduler on the 781 loops successfully scheduled by both formulations using no more than 15 minutes per loop. Also, using the more ecient representation enabled us to successfully nd a minimum MaxLive schedule in no more than 15 minutes for more loops (increasing from 781 to 920 loops) and larger loops (increasing from a maximum of 25 to a maximum of 41 operations). Furthermore, our approach fully capitalizes on the strength of optimal solvers by minimizing a precise formulation of the actual register requirements. Compared to approximating the register requirements by the average lifetime, the most e ective approximation that we found in the literature, minimizing MaxLive results in schedules with lower register requirements for 13.0% of the 1327 loops. Similarly, compared to approximating the register requirements by bu ers, minimizing MaxLive results in schedules with lower register requirements for 21.4% of the 1327 loops. Consequently, modulo schedulers based on integer linear programming formulations should be more applicable (because of their reduced computational complexity and accurate characterization of their register requirements for machines with arbitrary reservation tables) to situations where a achieving schedule with the minimum register requirements for the smallest possible initiation interval is highly desirable. For situations where the computational complexity of this approach exceeds practical considerations (e.g. for the 11.0% of the 1327 loops for which the NoObj ModuloScheduler does not complete in less than 15 minutes), or for situations where a schedule must be found quickly, we investigated an alternative approach to register-sensitive modulo scheduling with far reduced computational complexity. Our second contribution to register-sensitive modulo schedulers, is a technique, referred to as stage scheduling, that reduces the register requirements of a given MRT by shifting operations by integer multiples of II cycles. The major advantages of the 238

stage scheduling approach are its reduced complexity relative to the full scheduling problem above and the fact that lower register requirements are achieved without impacting the steady-state performance of the original modulo schedule. Furthermore, by bounding the maximum schedule length of an iteration to be no more than that of the original schedule, a signi cant decrease in register requirements can be achieved with no impact on the transient performance of the schedules as well. We contribute an optimal stage scheduling algorithm (MinReg Stage-Scheduler) that nds a schedule with minimum register requirements among all stage schedules that share a given MRT. We also contribute a set of stage scheduling heuristics that closely approximate the minimum solution at linear, or at most quadratic1 computational complexity. Our results indicate that for all the 920 loops successfully scheduled by the MinReg Modulo Scheduler in no more than 15 minutes per loop, the average register requirements are decreased by 22.2% by using the optimal modulo scheduler rather than a register-insensitive modulo scheduler. When using the optimal stage scheduler in conjunction with the register-insensitive Iterative Modulo Scheduler [80], the average register requirements decrease by 19.9%. Moreover, the average register requirements decrease by 19.8 and 19.2% when our best stage scheduling heuristics with unlimited schedule length and bounded schedule length, respectively, are employed. Interestingly, even the best bounded stage scheduling heuristic (using the MRTs produced by the register-insensitive Iterative Modulo Scheduler) results in lower average register requirements than when approximating the actual register requirements by bu ers and searching for a minimum bu er schedule among all MRTs. Thus, for situations where the computational complexity of the MinReg ModuloScheduler exceeds practical considerations, we recommend the following approach to achieve a high-performance, low register requirements modulo-schedule. First, search for a high-performance schedule using a state-of-the-art modulo scheduler, such as Rau's Iterative Modulo Scheduler. The level of performance achieved by this scheduler is remarkable: it was known to result in the minimum initiation interval for 96.0% of the 1327 loops; furthermore, we have shown that it actually achieves the minimum 1

Provided that the MinDist relation is given as input.

239

feasible initiation interval for 22 more loops, thus resulting in optimal throughput for 1296 (or 97.7%) of the 1327 loops. Note also that this result is achieved for the Cydra 5, a machine with very complex resource requirements. Second, to decrease the register requirements, if desired, using a hierarchy of stage scheduling heuristics. As shown in Chapter 4, linear time stage scheduling heuristics achieve a respectable decrease in register requirements, and should suce in most cases. If a larger decrease in register requirements is desirable, our best stage scheduling heuristics can be employed. If further reductions in register requirements are needed, we can compute a lower bound on the register requirements, as proposed by Hu [49], to estimate the likelihood of nding schedule with lower register requirements, by searching for either a new MRT-schedule, a minimum MaxLive stage schedule, or a combination of both. Increasing II or introducing spill code may also be required for loops with extreme register requirements or machines with small register les. The resulting schedule should have a signi cantly reduced register requirements, while preserving the steady-state performance of the initial register-insensitive schedule, since II is preserved during the stage scheduling process. Furthermore, the transient performance can also be preserved if the schedule length is bounded to be no more than that of the initial schedule. As a result the approach outlined here should result in a high throughput, low register requirements schedule that is close to the optimal solution since the Iterative Modulo Scheduler achieves the minimum feasible II for 97.7% of the 1327 loops, and the unbounded and bounded best stagescheduling heuristics achieve minimum MaxLive among all stage schedules that share the given MRT for, respectively, 95.9% and 91.6% of the 1327 loops. Finally, the register requirements can be further decreased in predicated code by computing the interferences among virtual registers in the presence of predicated operations and allowing noninterfering virtual registers to share a physical register even when their lifetimes overlap.

240

7.2 Future Research The work presented in this dissertation can be extended in numerous ways. First, the work on reduced machine representation indicates that generating a synthesized machine with fewer resources and resource usages signi cantly decreases the work units performed by the queries. Extending and broadening this result for machines with large numbers of alternative operations would be extremely interesting. One approach would be to factor subsets of resource usages among alternatives such that the query module can more quickly determine if none of the alternatives may be scheduled in a given cycle. Another approach would be to use counters to keep track of resources with multiple instances that are indistinguishable (i.e. each operation can use any resource instances). Counters may be integrated nicely in an extension of the bitvector scheme, if desired. Also, when alternative resources are not completely indistinguishable, there may be some mixture of counters and synthesized resources that can eciently represent such resources. Another direction of research for ecient representation of machine resources is to extend the nite state automaton approach to handle some of the functionality that is now only provided by the reservation table approach. For example, there may be an elegant way to implement the functionality of the assign&free function within the nite state automaton approach. A third direction of research is to investigate if, in presence of multiple alternatives, some knowledge can be integrated into the nite state automaton approach. For example, if in a given state of the automaton, one alternative is frequently selected, but also frequently unscheduled due to resource contention, there may be a way to incorporate this knowledge into the automata, by reducing the priority of that alternative relative to others. The work on optimal modulo schedulers can also be extended to cover a wider range of machines, such as machines with clusters of functional units. It can also be extended to consider optimal generation of spill code, as proposed for straight line code by Meleis and Davidson [66]. Beyond extending the functionality of the formulation, nding more ecient formulations would be very useful. 241

There are also numerous directions of research for ecient modulo scheduling heuristics, such as integrating scheduling and spilling heuristics, generating modulo schedules that include only some of the basic blocks found within a loop iteration, or integrating unrolling before scheduling with the unrolling required by the modulo variable extension. Also interesting would be to extend modulo scheduling heuristics to eciently handle machines with clusters of functional units, perhaps along the lines of the scheduling technique proposed by Ellis [41] for straight line code.

242

APPENDIX A Completion of Proof of Theorem 4.1 We demonstrate here that the stage scheduling problem for minimum integral MaxLive can be solved with a linear programming (LP) solver. In particular, we show that if an LP-solver nds an optimal solution, the optimal solution is integral. In the rst step of the demonstration, we reformulate the stage scheduling problem as de ned in Algorithm 4.1 and Figure 4.7. We express here the stage scheduling problem using the traditional integer linear programming formulation:

zILP = minfcx : Ax  b; x 2 Z+n g

(A.1)

namely, we formulate the constraint matrix A and the vector b that de ne the search space and we express the vector c that de nes the objective function for the minimum integral MaxLive stage scheduling problem. Note that Equation (A.1) de nes an integer linear programming problem since the solution vector x explicitly belongs to Z+n , i.e. x is an n component vector of nonnegative integers. This rst step is developed in Section A.1. In the second step of the proof, we demonstrate that the matrix A and vectors b and c have properties that ensure that the solution of the LP-relaxation of Equation (A.1):

zLP = minfcx : Ax  b; x 2 Rn+ g 243

(A.2)

, which permits nonnegative reals, does in fact result in an integer value for zLP , i.e. the zLP and zILP solutions are both integer and have the same cost, when an LP-solver is successfully used for zLP . Note that since the solution vector x in Equation (A.2) belongs to Rn+ instead of Z+n , an LP-solver can be used to solve Equation (A.2), as presented in Section A.2.

A.1 Formalization of Stage-Scheduling Problem As de ned in Section 4.3, the minimumintegral MaxLive stage-scheduling problem has the following objective function: minimize:

Bu =

X i2V

bi

(A.3)

subject to the following constraints:

X (i;j )2u

signi;j (u) pi;j = ? u

bi ? pi;j  i;j pi;j  0 bi  0

8u 2 UG

(A.4)

8(i; j ) 2 Ereg 8(i; j ) 2 Esched [ Ereg 8i 2 V

(A.5) (A.6) (A.7)

for the dependence graph G = fV; Esched ; Ereg g. In this formulation, bi and pi;j are integer variables and u and i;j are integer constants that correspond to the constant terms of Equation (4.6) and Inequality (4.12), respectively. Recall that UG is the set of all elementary cycles in the underlying graph of graph fV; Esched [Ereg g. Remember also that signi;j (u) is de ned as +1 if edge (i; j ) is traversed in the forward direction in the cycle u, ?1 if edge (i; j ) is traversed in the backward direction, and 0 otherwise. As formulated, the above problem formulation is not exclusively of the type \Ax  b" because of Equation (A.4), which corresponds to the scheduling constraints generated by the underlying cycles. However, we may successively eliminate each of these constraints by isolating one of its pi;j variables per equation and substituting 244

for that variable in the remaining relations of the system. To illustrate the e ect of this substitution, let us consider an elementary cycle, u, in the underlying graph of graph G1 = fV; Esched [ Ereg g. When isolating the variable pi;j associated with edge (i; j ) 2 u in the instance of Equation (A.4) that corresponds to u, we obtain the following equation: 0 1 B X C pi;j = signi;j (u) B B@? CA (A.8) signx;y (u) px;y ? uC (x;y)2u (x;y)6=(i;j )

Since the edges of the underlying cycle u form a close path, the edges (x; y) that satisfy (x; y) 2 u and (x; y) 6= (i; j ) precisely correspond to the edges of the alternate path from vertex i to vertex j along underlying cycle u. Furthermore, the ?signx;y (u) factors correspond to the direction in which the edges (x; y) are traversed in the alternate path from vertices i to j , when signi;j (u) is positive; and when signi;j (u) is negative, we we may simply rede ne the close path u to traverse the same underlying cycle in the reverse direction, thus resulting in signi;j (u) = 1. When the right hand side of Equation (A.8) is used in place of pi;j , we have eliminated all occurrences of pi;j in the system, and replaced them with a linear combination of the px;y that are associated with the edges of the alternate path from vertices i to j along the underlying cycle u. Since there is one instance of Equation (A.4) per underlying cycle, we may successively perform one substitution per underlying cycle. Instead of arbitrarily isolating and removing one of the pi;j variables from each cycle, we compute ahead of time an arbitrary spanning tree, T1, of the graph G1 and only substitute for those pi;j variables that correspond to edges not in T1. This substitution policy is feasible because there is a one to one mapping between the underlying cycles of G1 and the edges of G1 that are not in the spanning tree T1, as explained below by the second property of spanning trees. The rst property of a spanning tree T of graph G is that T de nes a unique path between any two connected vertices. We refer here to the path in the spanning tree T between vertices i and j as spathi;j (T ). The second property of a 245

spanning tree is that each edge (i; j ) that is not in T de nes a unique cycle in G, which consists of edge (i; j ) and the edges in spathi;j (T ). By eliminating each equality in the system in turn, we eventually obtain a system of inequalities of the type \Ax  b." By using the spanning tree T1 to decide which pi;j variables to eliminate, the nal system contains only those pi;j variables that correspond to edges in the spanning tree. Moreover, we can show by induction that each of the eliminated pi;j is substituted for in the nal system by a function of the px;y variables associated with the edges (x; y) 2 spathi;j (T1), i.e. the edges in the unique path in the spanning tree from vertices i to j . Similarly, we may also show that the coecient in front of each of these px;y variables is precisely signx;y (spathi;j (T1)). Thus, the nal system of inequalities can be formulated as follows: minimize:

Bu =

X i2V

bi

(A.9)

subject to:

bi ?

X

signx;y (spathi;j (T1)) px;y  0i;j

8(i; j ) 2 Ereg

signx;y (spathi;j (T1)) px;y  i;j0

8(i; j ) 2 Esched [ Ereg (A.11)

(x;y)2spathi;j (T1 )

X

(x;y)2spathi;j (T1 )

bi  0

8i 2 V

(A.10)

(A.12)

where bi and pi;j are integer variables, T1 is a spanning tree of the dependence graph G1, and 0i;j and i;j0 are integer constants that are linear combinations of the u and i;j integer constants in Equation (A.4) and Inequality (A.5). Note that Inequalities (A.10), (A.11), and (A.12) corresponds to Inequalities (A.5), (A.6), and (A.7), respectively. Note also that the only pi;j variables that appear in Inequalities (A.10) and (A.11) are the ones associated with edges of the spanning tree T1. Consider the dependence graph in Figure A.1a, which corresponds to Example 4.4 with its dependence graph, MRT, and formulation as shown in Figure 4.6a, Figure 4.6b, and Problem (4.13). The integer linear programming formulation of Ex246

a) Dependence graph

b) Underlying graph and spanning tree

ld0

v0

p02

e7

p05

m2

a5

v2

a1

p23

e1 e1’

v3 p56

p34

e3 e3’

a4 p46

add: mult: div: load: store:

a1,a4,a5 m2, m3 d6 ld0 st7

p67 edges in spanning tree

st7

e5 e5’

v4 e4 e4’

d6

v5

v1

e2 e2’

p14

m3

e8 e0 e0’

p01

v6 e6 e6’ v7

Figure A.1: Dependence graph and its spanning tree (Example 4.4).

Minimize: Subject to:

from (A.4) from (A.4) from (A.5) from (A.5) from (A.5) from (A.6) from (A.6) from (A.7)

b 0 + b1 + b2 + b3 + b4 + b5 + b6 + b7 p0;2 + p2;3 + p3;4 ? p0;1 ? p1;4 = ?2 (A.13) p0;1 + p1;4 + p4;6 ? p0;5 ? p5;6 = 0 (A.14) b0 ? p0;1  1; b0 ? p0;2  0; b0 ? p0;5  0 b1 ? p1;4  0; b2 ? p2;3  0; b3 ? p3;4  0 b4 ? p4;6  0; b5 ? p5;6  0; b6 ? p6;7  0 p0;1  0; p0;2  0; p0;5  0; p1;4  0; p2;3  0 p3;4  0; p4;6  0; p5;6  0; p6;7  0 bi  0 8i 2 V

Figure A.2: Formulation with underlying cycle constraints (Example 4.4).

247

Minimize: Subject to:

from (A.5) and (A.13) from (A.5) and (A.14) from (A.5) from (A.5) from (A.6) and (A.13) from (A.6) and (A.14) from (A.6) from (A.7)

b 0 + b1 + b2 + b3 + b4 + b5 + b6 + b7 b0 ? p0;1  1; b0 ? (p0;1 + p1;4 ? p3;4 ? p2;3)  ?2 b0 ? (p0;1 + p1;4 + p4;6 ? p5;6)  0; b1 ? p1;4  0 b2 ? p2;3  0; b3 ? p3;4  0; b4 ? p4;6  0 b5 ? p5;6  0; b6 ? p6;7  0 p0;1  0; (p0;1 + p1;4 ? p3;4 ? p2;3)  2 (p0;1 + p1;4 + p4;6 ? p5;6)  0; p1;4  0; p2;3  0 p3;4  0; p4;6  0; p5;6  0; p6;7  0 bi  0 8i 2 V

Figure A.3: Formulation with variables p0;2 and p0;5 eliminated; remaining pi;j variables correspond to the edges of the spanning tree of Figure A.1b. ample 4.4 is shown here in Figure A.2. It contains two equations, Equations (A.13) and (A.14) that correspond, respectively, to the left and right underlying cycles of Figure A.1. The spanning tree that is used to guide the pi;j variable substitution is shown in Figure A.1b with the bold edges. This spanning tree includes all but the edges (0; 2) and (0; 5), which are associated, respectively, with the left and right underlying cycles of Figure A.1b. When p0;1 + p1;4 ? p3;4 ? p2;3 + 2 is substituted for p0;2, and p0;1 + p1;4 + p4;6 ? p5;6 is substituted for p0;5, we obtain the system of inequalities shown in Figure A.3. Comparing the initial and the nal system of inequalities, we can see that p0;2 has been removed in the nal system by substituting p0;1 + p1;4 ? p3;4 ? p2;3 +2, which precisely corresponds to the path in the spanning tree from vertex 0 to vertex 2, plus a constant term; we can also verify that each px;y variable in this path is multiplied by the correct +1 or ?1 factor, accordingly to the direction shown in Figure A.1a. A similar observation applies to the removal of p0;5. We now introduce two matrix de nitions that are used to characterize a directed graph: 248

De nition A.1 The incidence matrix of a graph G = fV; E g with m vertices and n edges is the m  n matrix A where each row corresponds to a vertex and each column corresponds to an edge of G. Matrix A is de ned as:

ai;j

8 > > 1 if ej = (i; x) for some x 2 V nfig > > < = >?1 if ej = (x; i) for some x 2 V nfig > > > : 0 otherwise

where V nfig denote the set of vertices V minus the vertex i, which enforce the elimination of the self-loop edges from the incidence matrix. The out-incidence matrix, A, is constructed similarly, with ai;j = 1 if ai;j = 1, and ai;j = 0 otherwise.

Note that each column of the incidence matrix has precisely two nonzero entries; e.g. the column associated with edge (i; j ) has a +1 value in the row associated with the tail vertex i, a ?1 value in the row associated with the head vertex j , and 0 values elsewhere. Similarly, a column of the out-incidence has precisely one nonzero entry; e.g. the column associated with edge (i; j ) has a +1 value in the row associated with the tail vertex i and 0 values elsewhere.

De nition A.2 The network matrix of a graph G = fV; E g with m vertices and n edges is the (m?1)n matrix A, where each row corresponds to the edges of a spanning tree of graph G and each column corresponds to the edges of G. Let e00 : : :e0m?2 be the edges of the spanning tree T of graph G and let e0 : : : en?1 be the edges of G. In the de nition below, we say that \ej = (x; y) passes e0i " if the path in the spanning tree T between vertices x and y contains the spanning tree edge e0i . Matrix A is de ned as:

ai;j

8 > > 1 if ej passes e0i in the forward direction > > < = >?1 if ej passes e0i in the backward direction > > > : 0 otherwise

In general, the rows and the columns of A are arranged such that the rst m ? 1

249

columns of A corresponds to the edges of the spanning tree, i.e. e00 = e0 : : :e0m?2 = em?2, and thus A = [A1 A2] where A1 = I (the identity matrix).

Note that the column associated with edge (i; j ) of a network matrix for graph G and spanning tree T precisely de nes the signx;y function for spathi;j (T ), the path in the spanning tree T from vertex i to vertex j . Consider the dependence graph in Figure A.1a. The incidence matrix, D, associated with this graph is shown in Figure A.4. The legend for the rows and columns of D corresponds to, respectively, the vertex and edge labels in Figure A.1b. The network matrix, H , associated with the graph in Figure A.1a and the spanning tree shown in Figure A.1b. Note, for example, that column e7 = (0; 2) in H indicates that the path from vertex 0 to vertex 2 contains spanning tree edges e0 and e1 traversed in the forward direction and the spanning tree edges e2 and e3 traversed in the backward direction. We may verify in Figures A.1a and A.1b that this is indeed the case. Using the above de nitions, we may formulate the system of inequalities for the minimum integral MaxLive stage scheduling problem as a function of the outincidence matrices and network matrices. We consider here the dependence graph G = fV; Esched ; Ereg g with m vertices and n scheduling and register edges (n = jEsched [ Ereg j). Let H1 be the (m ? 1)  n network matrix of the graph G1 = fV; Esched [ Ereg g with spanning tree T1. Let p be the vector of size (m ? 1) containing the pi;j variables that correspond to spanning tree edges associated with each row of the H1 matrix. Similarly, let 0 be the vector of size n containing the i;j0 integer constants that correspond to edges associated with each column of the H1 matrix. We may formulate Inequality (A.11) as:

p H1  0

(A.15)

since a column of H1 precisely de nes the signx;y function for spathi;j (T1), i.e. the path in the spanning tree T1 from vertex i to vertex j as used in Inequality (A.11). Let D2 be the mn out-incidence matrix of the graph G2 = fV; Ereg g. Let b be the vector of size m containing the bi variables that correspond to the vertices associated 250

2 1 0 0 0 0 0 66 ?1 1 0 0 0 0 66 0 0 1 0 0 0 66 0 0 ?1 1 0 0 D = 66 0 ?1 0 ?1 1 0 66 0 0 0 0 0 1 4 0 0 0 0 ?1 ?1

0 1 1 0 0 0 0 ?1 0 0 0 0 0 0 0 0 0 ?1 1 0 0 0 0 0 0 0 0 ?1 0 0 e0 e1 e2 e3 e4 e5 e6 e7 e8

3 77 77 77 77 77 5

v0 v1 v2 v3 v4 v5 v6 v7

Figure A.4: Incidence matrix for dependence graph of Figure A.1 (Example 4.4). 2 66 66 6 H = 66 64

1 0 0 0 0 0 0 e0

0 1 0 0 0 0 0 e1

0 0 1 0 0 0 0 e2

0 0 0 1 0 0 0 e3

0 0 0 0 1 0 0 e4

0 0 0 0 0 1 0 e5

0 1 1 0 1 1 0 ?1 0 0 ?1 0 0 0 1 0 0 ?1 1 0 0 e6 e7 e8

3 77 77 77 77 5

e0 e1 e2 e3 e4 e5 e6 0 0 0 0 0 0 0

Figure A.5: Network matrix for spanning tree of Figure A.1. e0 e1 e2 e3 e4 e5 e6 e7 e8 e0 e1 e2 e3 e4 e5 e6 e7 e8

2 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 4

1 0 0 0 0 0 0 1 1

0 1 0 0 0 0 0 0 0

0 0 1 0 0 0 0 0 0

0 0 0 1 0 0 0 0 0

0 0 0 0 1 0 0 0 0

0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 1 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 v0

0 0 0 0 0 0 0 0 0 v1

0 0 0 0 0 0 0 0 0 v2

0 0 0 0 0 0 0 0 0 v3

0 0 0 0 0 0 0 0 0 v4

0 0 0 0 0 0 0 0 0 v5

0 0 0 0 0 0 0 0 0 v6

0 0 0 0 0 0 0 0 0 v7

?1 0 0 0 0 ?1 0 0 0 0 ?1 0 0 0 0 ?1 0 0 0 0 0 0 ?1 ?1 ?1 ?1 1 0 0 0 0 0 0 1 1 e0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 ?1 0 0 0 0 ?1 0 0 0 0 ?1 1 0 0 0 0 ?1 1 0

0 0 0 1 0

0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 ?1 ?1 1 0 0 e1 e2 e3 0

0

0

0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 ?1 e4 e5 0

0

0 0 0 0 0 0 1 0 0 e6

3 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 5

2 b 66 b01 66 b2 66 b3 66 b4 66 b5 66 b6 66 b7 66 66 p0 1 66 p1 4 66 p2 3 66 p3 4 66 p4 6 4p ;

;

;

;

;

p

5;6 6;7

2 1 3 66 0 77 666 00 77 66 0 77 66 0 77 66 0 77 66 ?2 77 66 0 77 66 77  66 0 77 66 0 77 66 0 77 66 0 77 66 0 77 66 0 5 66 0 64 2

0

Figure A.6: System of inequalities corresponding to Figure A.3. 251

0

3 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 5

with each row of the D2 matrix. Similarly, let 0 be the vector of size n containing the 0i;j integer constants that correspond to edges associated with each column of the D2 matrix. Note that each column in D2 associated with edge (i; j ) 62 Ereg consists of 0 entries only; similarly, each element in 0 associated with edge (i; j ) 62 Ereg is set to ?1. Using the above de nitions, we may formulate Inequality (A.10) as:

b D2 ? p H1  0

(A.16)

For each edge in Ereg , Inequality (A.16) results in one of the inequalities generated by Inequality (A.10); otherwise Inequality (A.16) results in an inequality that trivially holds since the 0 value associated with an edge not in Ereg is equal to ?1. There is only a minor problem with Inequality (A.16), because the self-loop register edges that appear in Inequality (A.10) do not appear in Inequality (A.16) since self-loop edges are not present in incidence and network matrices. However, since a self-loop edge (i; i) 2 Ereg only introduce inequality bi  0i;i, we may simply substitute bi ? 0i;i for bi in the system, for edge (i; i) 2 Ereg . We can then safely ignore these selfloop register edges. Note also that this problem does not occur for Inequality (A.15) because self-loops are not considered as an underlying cycle. Inequalities (A.15) and (A.16) can be combined and transposed to result in an integer linear programming formulation of the type \minfcx : Ax  b; x 2 Z+n g," namely: 9 8 2 32 3 2 3 > > T ?H T b 0 = < D 2 176 7 6 7 6 2 m ? 1  ; [ b p ] 2 Z min >[1 0] [b p] : 4 5 4 5 4 5 + > ; : 0 0 H1T p (A.17) where the constraint matrix is a 2n  (2m ? 1) matrix, the [1 0] and [b p] vectors are of dimension (2m ? 1), and the [ 0 0] vector is of dimension 2n. Note that we may omit from Equation (A.17) the inequalities generated by Inequality (A.12) since the solution vector [m p] is by de nition nonnegative in the integer linear programming formulation. The constraint matrix for Example 4.4 as formulated in FigureA.3 is 252

shown in Figure A.6.

A.2 LP-Relaxation of Stage-Scheduling Problem In this section, we demonstrate that the integer linear programming problem developed in Section A.1 can be solved using an LP-solver. In this demonstration, we use the following matrix property.

De nition A.3 An m  n integral matrix A is totally unimodular (TU) if the deter-

minant of each square submatrix of A is equal to 0, 1, or -1.

The TU property is used in the following theorem [69, pp. 536 and 541] to guarantee that the LP-relaxation of an integer linear programming problem results in an integer solutions, if any solution is found:

Theorem A.1 The value zLP = minfcx : Ax  b; x 2 Rn+ g produced by an LPsolver integral if m  n matrix A is TU, vector b 2 Z m, and vector c 2 Z n, provided that the problem is feasible, i.e. a solution is found.

Since the vectors [ 0 0] and [1 0] of Problem (A.17) have integer values, it is sucient to show that the constraint matrix of Problem (A.17) is TU. Showing that a matrix is TU is NP in general [69, p. 540]; because of the structure of Problem (A.17). However, we show here that its constraint matrix is always TU. We introduce the following de nition and theorems that are used in the demonstration.

De nition A.4 Let A be a matrix with entries equal to 0, 1, and -1. A submatrix

of A, comprised of the set of rows I and the set of columns J , is Eulerian if the sum of its elements in each row, and in each column, is even, i.e.

X j 2J

X i2I

ai;j  0 mod 2

8i 2 I

ai;j  0 mod 2

8j 2 J

253

This de nition is used in the following theorem by Camion [17] to show the TUproperty of a matrix.

Theorem A.2 Let A be a matrix with entries equal to 0, 1, and -1. Matrix A is TU

if and only if for every square Eulerian submatrix, comprised of the set of rows I and the set of columns J , the sum of the elements can be divided by 4, i.e.

X i2I; j 2J

Ai;j  0 mod 4

We demonstrate the TU property of the constraint matrix of Problem (A.17) in two steps. We rst show in Theorem A.3 that a matrix related to the constraint matrix of Problem (A.17) is TU. Then we demonstrate in Theorem A.6 that we may transform the related matrix into the constraint matrix of Problem (A.17) while preserving the TU-property.

Theorem A.3 Consider the dependence graph G = fV; Esched ; Ereg g. The matrix: 3 2  64D2 0 75 D1 ?D1

(A.18)

is TU, where D1 is the incidence matrix of graph G1 = fV; Esched [ Ereg g and D2 is the out-incidence matrix of graph G2 = fV; Ereg g. The incidence matrices are matrices of size m  n, where m is the number of vertices and n is the number of edges, i.e. n = jEsched [ Ereg j).

Ning and Gao [70] have shown that the above theorem applies in the case where Esched = Ereg . Our proof for the general case where Esched 6= Ereg is directly based on their proof with the necessary adaptation for giving it the broader applicability that we require here.

Proof. Let A be the matrix de ned by Equation (A.18), consisting of the upper-left, upper-right, lower-left, and lower-right submatrices D2 , 0, D1 , and ?D1, respectively. 254

We demonstrate Theorem A.3 by showing that the sum of elements of an arbitrary Eulerian submatrix of A can be divided by 4. If this result holds, A is TU by Theorem A.2. Consider an arbitrary square Eulerian submatrix of A, referred to as A0, which includes the set of rows I and the set of columns J . The demonstration will account for each of the nonzero elements in A0, and show that their total sum can be divided by 4. First, we assume by hypothesis H1 that the Eulerian submatrix A0 includes at least one +1 element in the upper-left submatrix of A. Consider ai;j = 1, i 2 I , and j 2 J in the upper left matrix where 0  i < m and 0  j < n. Recall that the upper-left submatrix of A corresponds to the out-incidence matrix of graph G2 = fV; Ereg g and thus contains only 0 and +1 elements. Remember also that the upper-right matrix of A is a zero-matrix. Thus, all the elements in row i in the upper matrices of A are either 0 or +1. Since hypothesis H1 states some that A0 includes some row i and column j with a +1 entry in element ai;j , A0 must also include another column with a +1 element in row i in order to be Eulerian. This +1 element can only be found in the upper left matrix; thus there must be a column j 0 such that ai;j = 1, j 0 2 J , and 0  j 0 < n. Furthermore, we know that columns j and j 0 are associated with two register edges, referred to as e and e0, respectively. We also know that edges e and e0 are both outgoing edges from the same vertex, i.e. the vertex associated with row i which we refer to as v. To be Eulerian, the sum of the elements in columns j and j 0 of A0 must also be a multiple of 2. Note that there is only one nonzero entry per column in the upper-left submatrix since it corresponds to the out-incidence matrix of graph G2. Thus, at least one nonzero element in columns j and j 0 must be selected from the lower-left submatrix of A. Furthermore, the lower-left submatrix contains precisely one +1 and one ?1 entry per column since it corresponds to the incidence matrix of the graph G1 = fV; Esched [ Ereg g. Let us further assume by hypothesis H2 that A0 selects the +1 entry in column j from the lower-left submatrix of A, i.e. there is a row i0 such that ai ;j = 1, i0 2 I , and 0

0

255

m  i0 < 2m. Since we know by hypothesis H1 that ai;j = 1 (i.e. edge e associated with column j is an outgoing edge from vertex v associated with row i in graph G2) and by hypothesis H2 that ai ;j = 1 (i.e. the same edge e is also an outgoing edge from the vertex associated with row i0 in graph G1), rows i and i0 must be associated with the same vertex v. In turn, this implies that the element at the intersection of the selected row i0 and column j 0 is positive (i.e. ai ;j = 1) because we know that the edge e0 associated with the selected column j 0 is also an outgoing edge from vertex v. Thus, hypotheses H1 and H2 imply four +1 entries, namely ai;j , ai ;j , ai;j , and ai ;j . Since these four entries add 4 to the total sum of the elements in A0, the contribution of these element is clearly a multiple of 4. If, instead of hypothesis H2, we assume by hypothesis H3 that A0 selects the ?1 entry in column j from the lower-left submatrix, i.e. there is a row i00 such that i00 6= i0, ai ;j = ?1, i00 2 I , and m  i00 < 2m. Note that since hypotheses H1, H2, and H3 cannot all be true1, assuming H1 and H3 implies that H2 is false. Because a +1 and a ?1 entry are the only selected elements in column j , the total contribution of the entries ai;j and ai ;j is 0, which clearly adds a contribution to the total sum of the elements in A0 that is a multiple 4. Similarly, we know by hypothesis that H2 is false, i.e. i0 62 I . As a result, we know that A0 must also select its ?1 entry in column j 0 from the lower-left submatrix, say in row i000. Thus, the total contribution of the entries ai;j and ai ;j also adds up to 0, again a contribution that is a multiple of 4. We have now proved that any nonzero entries that can be selected from the upperleft matrix result in a total contribution that is a multiple of 4, since we have analyzed the only two possible (and disjoint) cases with hypothesis H1, i.e. \H1 and H2 and not H3" and \H1 and not H2 and H3." Let us now discard hypotheses H1, H2, and H3 and assume by hypothesis H4 that the Eulerian submatrix A0 selects an arbitrary nonzero element that is not yet accounted for from the lower submatrices of A, i.e. a new element in row i and column j such that ai;j 6= 0, i 2 I , j 2 J , m  i < 2m, 0

0

0

0

0

0

0

00

00

0

000

0

1 If we assume H1, H2, and H3 to be true, they contribute, respectively, the +1, +1, and ?1 entries in column j , which does not sum up to a multiple of 2. Since these 3 entries are the only nonzero entries in column j , A cannot be Eulerian in column j . Thus assuming H1, H2, and H3 results in a contradiction. 0

256

and 0  j < 2n. Since we cannot select another nonzero entry from the upper-right submatrix (case already covered in hypotheses H1, H2, and H3) or from the upperleft matrix (which is a zero submatrix), another nonzero element in column j must be selected from the lower submatrices of A if A0 is to be Eulerian. Because the lower matrices of A are incidence matrices, there are only 2 nonzero entries per column, one with the value +1 and one with the value ?1. Thus, we must select the other nonzero entry, i.e. there is a row i0 such that ai ;j = ?ai;j , i0 2 I , and m  i0 < 2m. As a result, the contribution of the two entries ai;j and ai ;j also sums up to 0. To conclude, hypotheses H1 to H4 account for each nonzero entry in some arbitrary Eulerian submatrix A0. We have shown that the contribution of these nonzero entries can always be grouped such that the sum for each of these disjoint groups of nonzero entries adds up to either 0 or 4. Thus, the total sum of the entries in A0 can be divided by 4, and A is TU by Theorem A.2. 2 0

0

In the second step, we show that we may transform Matrix (A.18) into the constraint matrix of Problem (A.17) by using transformation steps that preserve the TU property of Matrix (A.18). The TU-preserving transformations that we used are found in [69, pp 540] and described here.

Theorem A.4 The following statements are equivalent: (1) A is TU; (2) the trans-

pose of A is TU; (3) a matrix obtained by deleting a row or a column, or interchanging a row or a column, of A is TU; (4) a matrix obtained by multiplying a row or a column, of A by -1 is TU; (5) a matrix obtained by a pivot operation on A is TU.

We also use a theorem that indicates how network matrices relate to incidence matrices [69, pp. 546-547].

Theorem A.5 A network matrix is precisely a matrix whose columns represent the

incidence matrix of a graph after one row has been deleted and a number of pivots have been executed until the resulting matrix is of the form [I A].

Using Theorems A.4 and A.5, we may prove the following theorem. 257

Theorem A.6 The constraint matrix of Problem (A.17) is TU. Proof. We already know from Theorem A.3 that Matrix (A.18) is TU. In this proof,

we simply show how this matrix can be transformed in the constraint matrix of Problem (A.17) using exclusively transforms that preserve the TU property of the matrices. The rst step is to transform the incidence matrix of graph G1 into the network matrix of graph G1, i.e. D1 into H1. Using Theorem A.5, we know that D1 can be transformed into H1 by only using TU-preserving transformation steps, as de ned in Theorem A.4. This implies, in turn, that the matrix [D1 ? D1] can also be transformed into the matrix [H1 ? H1] by using exactly the same TU-preserving transformation steps. As a result, the TU Matrix (A.18) can be transformed in the following matrix: 3 2  64D2 0 75 (A.19) H1 ?H1 using only TU-preserving transforms. Multiplying the rows of the submatrix [H1 ? H1] from Matrix (A.19) by -1 and taking its transpose results in the following matrix: 2 3 T ?H T D 17 64 2 (A.20) 5 0 H1T which is TU since both operations preserves the TU-property. Matrix (A.20) corresponds to the constraint matrix of Problem (A.17), and thus Theorem A.6 is proved.

2

As a result, Theorem A.1 now applies, and the solution of Problem (A.17) is therefore integral whenever an LP-solver nds an optimal solution.

258

REFERENCES [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, MA, 1974. [2] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley, Reading, MA, 1988. [3] A. Aiken, A. Nicolau, and S. Novak. Resource-constrained software pipelining. IEEE Transactions on Parallel and Distributed Systems, 6(12):1248{1269, December 1995. [4] J. R. Allen, K. Kennedy, C. Porter eld, and J. Warren. Conversion of control dependence to data dependence. In Conference Record of the 10th Annual ACM Symposium on Principles of Programming Languages, pages 177{189, January 1983. [5] V. H. Allen, U. R. Shah, and K. M. Reddy. Petri net versus modulo scheduling for software pipelining. In Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 105{110, November 1995. [6] E. R. Altman. Optimal Software Pipelining with Function Unit and Register Constraints. PhD thesis, Department of Electrical Engineering, McGill University, Montreal, Canada, October 1995. [7] E. R. Altman, R. Govindarajan, and G. R. Gao. Scheduling and mapping: Software pipelining in the presence of structural hazards. In Proceedings of the ACM SIGPLAN'95 Conference on Programming Language Design and Implementation, pages 139{150, 1995. [8] V. Bala. Personal communication. February 1996. [9] V. Bala and N. Rubin. Ecient instruction scheduling using nite state automata. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 46{56, November 1995. [10] R. A. Ballance, A. B. Maccabe, and K. J. Ottenstein. The program dependence web: A representation supporting control-, data-, and demand-driven interpretation of imperative languages. Proceedings of the ACM SIGPLAN'90 Conference on Programming Language Design and Implementation, pages 257{271, 1990. 259

[11] G. R. Beck, D. W. L. Yen, and T. L. Anderson. The Cydra 5 mini-supercomputer: Architecture and implementation. In The Journal of Supercomputing, volume 7, pages 143{180, 1993. [12] D. Bernstein and M. Rodeh. Global instruction scheduling for superscalar machines. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, pages 241{255, June 1991. [13] M. Berry et al. The Perfect Club Benchmarks: E ective performance evaluation of supercomputers. The International Journal of Supercomputer Applications, 3(3):5{40, Fall 1989. [14] D. A. Berson, R. Gupta, and M. L. So a. URSA: A uni ed resource allocator for registers and functional units in VLIW architectures. In IFIP Working Conference on Architecture and Compilation Techniques for Fine and Medium Grain Parallelism, pages 241{255, January 1993. [15] Preston Briggs, Keith D. Cooper, Ken Kennedy, and L. Torczon. Coloring heuristics for register allocation. Proceedings of the ACM SIGPLAN'89 Conference on Programming Language Design and Implementation, 24(7):275{284, June 1989. [16] R. A. Brualdi. Introductory combinatorics. New York: North-Holland, 1992. [17] P. Camion. Characterization of totally unimodular matrices. In Proc. Amer. Math. Soc., volume 16, pages 1068{1073, 1965. [18] G. J. Chaitin. Register allocation and spilling via graph coloring. Proceedings of the ACM SIGPLAN'82 Symposium on Compiler Construction, pages 98{105, June 1982. [19] P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W. W. Hwu. IMPACT: An architectural framework for multiple-instruction-issue processors. In Proceedings of the Eighteenth Annual International Symposium on Computer Architecture, pages 266{275, May 1991. [20] P. P. Chang, N. J. Warter, S. A. Mahlke, W. Y. Chen, and W. W. Hwu. Three architectural models for compiler-controlled speculative execution. IEEE Transactions on Computers, 44(4):481{494, April 1995. [21] L. Chao, A. LaPaugh, and E. H. Sha. Rotation scheduling: A loop pipelining algorithm. 30th Design Automaton Conference, pages 556{572, 1993. [22] A. E. Charlesworth. An approach to scienti c array processing: The architectural design of the AP-120B/FPS-164 family. Computer, 14(9):18{27, September 1981. [23] S. Chaudhuri, R. A. Walker, and J. E. Mitchell. Analyzing and exploiting the structure of the constraints in the ILP approach to the scheduling problem. IEEE Transactions on Very Large Scale Integration Systems, 2(4):456{471, December 1994. 260

[24] R. G. Cytron. Compile-Time Scheduling and Optimization for Asynchronous Machines. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL, 1984. [25] E. S. Davidson, L. E. Shar, A. T. Thomas, and J. H. Patel. E ective control for pipelined computers. Spring COMPCON-75 digest of papers, pages 181{184, February 1975. [26] J. C. Dehnert, P. Y.-T. Hsu, and J. P. Bratt. Overlapped loop support in the Cydra 5. 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, pages 181{227, May 1989. [27] J. C. Dehnert and R. A. Towle. Compiling for the Cydra 5. In The Journal of Supercomputing, volume 7, pages 181{227, 1993. [28] Digital Equipment Corp., Maynard, MA. DecChip 21064 Microprocessor Hardware Reference Manual EC-N0079-72. [29] B. Dupont de Dinechin. Simplex scheduling: More than lifetime-sensitive instruction scheduling. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, pages 327{330, 1994. [30] K. Ebcioglu, R. D. Groves, K.-C. Kim, G. M. Silberman, and I. Ziv. VLIW compilation techniques in a superscalar environment. In Proceedings of the ACM SIGPLAN'94 Conference on Programming Language Design and Implementation, pages 36{48. 1994. [31] A. E. Eichenberger and E. S. Davidson. A reduced multipipeline machine description that preserves scheduling constraints. Technical Report CSE-TR-266-95, University of Michigan, Ann Arbor, MI, 1995. [32] A. E. Eichenberger and E. S. Davidson. Register allocation for predicated code. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 338{349, November 1995. [33] A. E. Eichenberger and E. S. Davidson. Stage scheduling: A technique to reduce the register requirements of a modulo schedule. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 180{191, November 1995. [34] A. E. Eichenberger and E. S. Davidson. Ecient formulation for optimal modulo schedulers. Proceedings of the ACM SIGPLAN'97 Conference on Programming Language Design and Implementation, May 1996. [35] A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Minimum register requirements for a modulo schedule. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 75{84, November 1994.

261

[36] A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Optimum modulo schedules for minimum register requirements. Proceedings of the International Conference on Supercomputing, pages 31{40, July 1995. [37] A. E. Eichenberger, E. S. Davidson, and S. G. Abraham. Minimizing register requirements of a modulo schedule via optimum stage scheduling. In International Journal of Parallel Programming, volume 24, pages 103{132, 1996. [38] C. Eisenbeis, W. Jalby, and A. Lichnewsky. Squeezing more performance out of a Cray-2 by vector block scheduling. Proceedings of Supercomputing '88, pages 237{246, November 1988. [39] C. Eisenbeis, S. Lelait, and B. Marmol. The meeting graph: A new model for loop cyclic register allocation. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, June 1995. [40] C. Eisenbeis and D. Windheiser. Optimal software pipelining in presence of resource constraints. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, August 1993. [41] J. R. Ellis. Bulldog: a Compiler for VLIW Architectures. The MIT Press, 1985. [42] J. A. Fisher. Trace scheduling: a technique for global microcode compaction. IEEE Transactions on Computers, 30(7):478{490, July 1981. [43] J. R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic blocks. Proceedings of the International Conference on Supercomputing, pages 442{452, 1988. [44] R. Govindarajan, E. R. Altman, and G. R. Gao. Minimizing register requirements under resource-constrained rate-optimal software pipelining. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 85{94, November 1994. [45] J. C. Gyllenhaal. A machine description language for compilation. Master's thesis, Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL, 1994. [46] L. J. Hendren, G. R. Gao, E. R. Altman, and C. Mukerji. A register allocation framework based on hierarchical cyclic interval graphs. Proceedings of the 4th International Conference on Compiler Construction. Lecture Notes in Computer Science 641, Springer, pages 176{191, 1992. [47] F. S. Hillier and G. J. Lieberman. Introduction to Mathematical Programming. McGraw-Hill, 1990. [48] P. Y. Hsu. Highly Concurrent Scalar Processing. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL, 1986. 262

[49] R. A. Hu . Lifetime-sensitive modulo scheduling. Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation, pages 258{267, June 1993. [50] W. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, J. Warter N, R. A. Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery. The superblock: An e ective technique for VLIW and superscalar compilation. In The Journal of Supercomputing, volume 7, pages 229{248, 1993. [51] R. C. Johnson. Personal communication. June 1995. [52] K. Ebcioglu and T. Nakatani. A New Compilation Technique for Parallelizing Loops with Unpredictable Branches on a VLIW Architecture. In D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers for Parallel Computing, pages 213{229. Pitman/The MIT Press, 1990. [53] G. Kane and J. Heinrich. MIPS RISC Architecture. Prentice Hall, 1992. [54] V. Kathail, M. S. Schlansker, and B. R. Rau. HPL PlayDoh architecture speci cation: Version 1.0. Technical Report HPL-93-80, HP Laboratories, February 1994. [55] M. Lam. Software Pipelining: An e ective scheduling technique for VLIW machines. Proceedings of the ACM SIGPLAN'88 Conference on Programming Language Design and Implementation, pages 318{328, June 1988. [56] D. M. Lavery and W. W. Hwu. Unrolling-based optimizations for modulo scheduling. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 327{337, November 1995. [57] J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero. Swing modulo scheduling: A lifetime-sensitive approach. Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, October 1996. [58] J. Llosa, M. Valero, E. Ayguade, and A. Gonzalez. Hypernode reduction modulo scheduling. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 350{360, November 1995. [59] G. P. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S. O'Donnell, and J. C. Ruttenberg. The Multi ow trace scheduling compiler. In The Journal of Supercomputing, volume 7, pages 51{142, 1993. [60] S. A. Mahlke. Exploiting Instruction Level Parallelism in the Presence of Conditional Branches. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL, 1996. [61] S. A. Mahlke, W. Y. Chen, W.-M. W. Hwu, B. R. Rau, and M. S. Schlansker. Sentinel scheduling for VLIW and superscalar processors. In 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 238{247, October 1992. 263

[62] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. E ective compiler support for predicated execution using hyperblocks. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 45{54, December 1992. [63] W. Mangione-Smith. Performance Bounds and Bu er Space Requirements. PhD thesis, University of Michigan, Department of Electrical Engineering and Computer Science, Ann Arbor, MI, 1992. [64] W. Mangione-Smith, S. G. Abraham, and E. S. Davidson. Register requirements of pipelined processors. Proceedings of the International Conference on Supercomputing, pages 260{271, July 1992. [65] F. H. McMahon. The Livermore Fortran Kernels: A computer test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory, Livermore, California, 1986. [66] W. M. Meleis and E. S. Davidson. Optimal local allocation for a multiple-issue machine. Proceedings of the International Conference on Supercomputing, pages 107{116, 1994. [67] S.-M. Moon and K. Ebcioglu. An ecient resource-constrained global scheduling technique for superscalar and VLIW processors. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 55{71, September 1992. [68] T. Muller. Employing nite automata for resource scheduling. Proceedings of the 26th Annual International Symposium on Microarchitecture, pages 12{20, 1993. [69] G. L. Nemhauser and L. A. Wolsey. Integer and Combinatorial Optimization. Wiley, New York, 1988. [70] Q. Ning and G. R. Gao. A novel framework of register allocation for software pipelining. Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 29{42, 1993. [71] S. Olariu. Optimal greedy heuristic to color interval graphs. Information Processing letters, 37(1):21{25, 1991. [72] J. C. H. Park and M. S. Schlansker. On predicated execution. Technical Report HPL-91-58, HP Laboratories, May 1991. [73] J. H. Patel and E. S. Davidson. Improving the throughput of a pipeline by insertion of delays. Proceedings of the Third Annual International Symposium on Computer Architecture, pages 159{164, 1976. [74] K. Paton. An algorithm for nding a fundamental set of cycles of a graph. Communications of the ACM, 12(9):514{518, September 1969. 264

[75] S. S. Pinter. Register allocation with instruction scheduling: a new approach. Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation, pages 248{257, June 1993. [76] T. A. Proebsting and C. W. Fraser. Detecting pipeline structural hazards quickly. Twenty-First Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 280{286, January 1994. [77] B. R. Rau. Iterative Modulo Scheduling: An algorithm for software pipelining loops. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 63{74, November 1994. [78] B. R. Rau. Iterative Modulo Scheduling. International Journal of Parallel Programming, 24(1):2{64, 1996. [79] B. R. Rau and J. A. Fisher. Instruction-level parallel processing: History, overview, and perspective. In The Journal of Supercomputing, volume 7, pages 9{50, 1993. [80] B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scienti c computing. Fourteenth Annual Workshop on Microprogramming, pages 183{198, October 1981. [81] B. R. Rau, C. D. Glaeser, and R. L. Picard. Ecient code generation for horizontal architectures: Compiler techniques and architecture support. Proceedings of the Ninth Annual International Symposium on Computer Architecture, pages 131{139, 1982. [82] B. R. Rau, M. Lee, P. P. Tirumalai, and M. S. Schlansker. Register allocation for software pipelined loops. Proceedings of the ACM SIGPLAN'92 Conference on Programming Language Design and Implementation, pages 283{299, June 1992. [83] J. R. Ruttenberg, G. R. Gao, and A. Stoutchinin. Software pipelining showdown: Optimal vs. heuristic methods in a production compiler. Proceedings of the ACM SIGPLAN'96 Conference on Programming Language Design and Implementation, pages 1{11, May 1996. [84] M. S. Schlansker. Personal communication. June 1995. [85] M. S. Schlansker and V. Kathail. Acceleration of rst and higher order recurrences on processors with instruction level parallelism. Proceedings of the Languages and Compilers for Parallel Computing, 6th International Workshop, August 1993. [86] M. S. Schlansker and V. Kathail. Critical path reduction for scalar code. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 57{69, November 1995. 265

[87] M. S. Schlansker, V. Kathail, and S. Anik. Height reduction of control recurrences for ILP processors. Proceedings of the 27th Annual International Symposium on Microarchitecture, pages 40{51, November 1994. [88] E. Su, A. Lain, S. Ramaswamy, D. J. Palermo, E. W. Hodges IV, and P. Banerjee. Advanced compilation techniques in the PARADIGM compiler for distributedmemory multicomputers. Proceedings of the International Conference on Supercomputing, pages 424{433, July 1995. [89] P. P. Tirumalai, M. Lee, and M. S. Schlansker. Parallelization of loops with exits on pipelined architectures. Proceedings of Supercomputing '90, pages 200{212, November 1990. [90] P. Tu and D Padua. Gated SSA-based demand-driven symbolic analysis for parallelizing compilers. Proceedings of the International Conference on Supercomputing, pages 414{423, July 1995. [91] J. Uniejewski. SPEC Benchmark Suite: Designed for today's advanced system. SPEC Newsletter, Fall 1989. [92] J. Wang, C. Eisenbeis, M. Jourdan, and B. Su. Decomposed software pipelining: A new perspective and a new approach. In International Journal of Parallel Programming, volume 22, pages 357{379, 1994. [93] J. Wang, A. Krall, and M.A. Ertl. Decomposed software pipelining with reduced register requirement. In Proceedings of the International Conference on Parallel Architecture and Compiler Techniques, June 1995. [94] N. J. Warter. Modulo Scheduling with Isomorphic Control Transformations. PhD thesis, Department of Electrical and Computer Engineering, University of Illinois, Urbana, IL, 1994. [95] N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. Enhanced Modulo Scheduling for loops with conditional branches. Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 170{179, December 1992. [96] N. J. Warter-Perez and N Partamian. Modulo scheduling with multiple initiation intervals. Proceedings of the 28th Annual International Symposium on Microarchitecture, pages 111{118, November 1995.

266

Suggest Documents