Software Pipelining of Loops by the Method of Modulo Scheduling

ISSN 0361-7688, Programming and Computer Software, 2007, Vol. 33, No. 6, pp. 307–315. © Pleiades Publishing, Ltd., 2007. Original Russian Text © N.I. V’yukova, V.A. Galatenko, S.V. Samborskii, 2007, published in Programmirovanie, 2007, Vol. 33, No. 6.

Software Pipelining of Loops by the Method of Modulo Scheduling N. I. V’yukova, V. A. Galatenko, and S. V. Samborskii Scientific Research Institute of System Studies, Russian Academy of Sciences, Nakhimovskii pr. 36/1, Moscow, 117218 Russia e-mail: [email protected] Received October 13, 2006

Abstract—Software pipelining is an efficient method of loop optimization that allows for parallelism of operations related to different loop iterations. Currently, most commercial compilers use loop pipelining methods based on modulo scheduling algorithms. This paper reviews these algorithms and considers algorithmic solutions designed for overcoming the negative effects of software pipelining of loops (a significant growth in code size and increase in the register pressure) as well as methods making it possible to use some hardware features of a target architecture. The paper considers global-scheduling mechanisms allowing one to pipeline loops containing a few basic blocks and loops in which the number of iterations is not known before the execution. DOI: 10.1134/S0361768807060023

1. INTRODUCTION Investigations in the field of compiling for modern superscalar and VLIW (Very Large Instruction Word) architectures are mainly focused on the problems of identification and efficient use of the instruction-level parallelism, which is characteristic of microprocessors of these types (see, for example, [1]). Within a single basic block, the parallelism is normally insufficient; therefore, acyclic segments of the code require methods of global scheduling (trace scheduling [2, 3], superblock scheduling [4], etc.) when some instructions are moved from their basic blocks to adjacent blocks and, thus, a several basic blocks are executed concurrently. In a loop, we operate with sequences of basic blocks that are different loop iterations. The idea of software pipelining of loops is to build a schedule where successive iterations of the original loop are initiated with some constant interval and to minimize this interval through an overlap of different iterations of the original loop. To ensure that the original and pipelined loops are equivalent, the pipelined loop gets additional instructions as a prolog and epilog from initial and final iterations. Example 1. Let us consider the following loop for (i=0; i< N; i++) z[i]=2*x[i]+y[i]

Figure 1a shows the data dependence graph of this loop. Figure 1b demonstrates the schedule of a single iteration on a processor that can run up to three instructions per clock cycle, executes the multiplication operation in 4 clock cycles, and executes the remaining operations in 1 clock cycle. In this case, the execution of a single iteration takes 7 clock cycles, and, because of the data dependences, the processor parallelism

remains almost unused. Figure 1c gives an example of pipelining for this loop (taken from [5], p. 60). In this example, the body (kernel) of the pipelined loop contains combined instructions in the range of four neighboring iterations; due to this, the execution of a single iteration of the pipelined loop takes just 2 clock cycles (instead of 7 for the original loop). Thus, the loop pipelining may lead to a considerable growth in performance allowing the processor resources to be fully used in some cases. The equivalence of computations relative to the original loop is provided by adding a prolog (containing the “missing” instructions of the initial iterations) to the beginning of the body of the pipelined loop and an epilog (containing the instructions of the terminal iterations) to the end of the body; the number of loop recurrences decreases accordingly by a value equal to the depth of iteration overlapping (in this example, this value is equal to 3). There are two main classes of pipelining algorithms. The algorithms of the move-then-schedule type (see [6]), which are also called code motion algorithms, build up the kernel of the pipelined loop by a successive cyclic upward motion of instructions with a later scheduling and choice of the best variant. The algorithms of the second class do not construct a loop schedule by successive transformations of the original loop but build up the loop kernel on the basis of instructions of the original loop “from scratch.” In this class, one distinguishes algorithms based on loop unrolling and packing (see [7, 8]) and modulo scheduling algorithms. At present, commercial compilers often use the method of modulo schedulingbecause there are efficient algorithms that find optimal or nearly optimal solutions of the problem for modern processor architec-

307

308

V’YUKOVA et al. (a)

Data dependence graph

(b)

Schedule of an iteration of the initial loop Clock cycle

Load x

0 *

1

Load y

Instructions Ld x,Ld y *

2 3 +

4 5

Store

6

(c)

+ St z

Schedule of pipelined loop

Clock cycle Numbers of iterations and instructions 0 1 2 3 Prolog Ldx, Ldy 0 * 1 Ldx, Ldy 2 3 * 4 Ldx, Ldy 5 + * i+1 i i+2 i+3 Kernel Ldx, Ldy i*2+6 Stz + i*2+7 * N–3 N–2 N–1 Epilog N*2–5 Stz N*2–4 + Stz N*2–3 + N*2–2 Stz N*2–2 Fig. 1. Pipelining of the loop in Example 1.

tures [9–12]. This method can be applied to a wide class of loops including those with conditional constructs, loops with the number of iterations unknown before the execution, and loops with multiple exits. The use of software pipelining has also negative features, such as growth in the code size and increase in the register pressure. The code size grows due to prologs and epilogs and to a number of other reasons. The register pressure increases due to the fact that, in the pipelined loop, the lifetimes of the variables related to different iterations of the loop are normally overlapped, which leads to the respective increase in the number of simultaneously “open” lifetimes. To overcome these negative features, some general-purpose microprocessor architectures have special hardware features, such the rotation of the register file and the predicate execution. However, the microprocessor architectures

designed for built-in systems and digital signal processing normally have no such features because their implementation leads to considerably complicated architectures and the need of extending the chip size as well as to other problems [11]. Therefore, it is important to conduct research and algorithmic investigations aimed at reducing the code size and decreasing the need of registers for loop pipelining. These problems are considered in sections 2.3 and 2.4. In this review, we consider various algorithms of software pipelining by the method of modulo scheduling, methods for reducing the code size and decreasing the register pressure, and methods of global modulo scheduling that are used to pipeline complex loops including those with conditional constructs and multiple exits.

PROGRAMMING AND COMPUTER SOFTWARE

Vol. 33

No. 6

2007

SOFTWARE PIPELINING OF LOOPS BY THE METHOD (1, 0)

t1 = B[i – 2]

(4, 0)

t2 = t1 + C

309

(1, 1)

B[i] = t2

(1, 2)

i=i+1 (1, 1)

(1, 1) Fig. 2. Dependence graph for the loop in Example 2.

2. THE METHOD OF MODULO SCHEDULING 2.1. Basic Concepts Related to Loop Pipelining A pipelined loop is executed in three phases: the speedup phase (prolog), when the pipeline starts to fill up; the steady phase, when the pipeline filling is at the highest level; and the termination phase (epilog), when the pipeline is cleared. During the steady phase, the same sequence of instructions called the loop kernel is cyclically executed. Software pipelining aims at creating such a schedule of the loop execution that the successive iterations of the original loop are scheduled with a constant interval called the initiation interval and denoted normally as II. This interval is determined by a single execution of the kernel of the pipelined loop. In our example, II = 2. The objective function of the problem of software pipelining is to minimize II. A key characteristic of the quality of loop optimization is the value called throughput; it is defined as N/T, where N is the number of iterations and T is the time (in clock cycles) of execution of the whole loop. For a large number of iterations, the throughput is approximately equal to 1/II. Lower bound on the value of the initiation interval. The lower bound on II depends on the processor resources and limitations of dependences between loop instructions. The resource-constrained lower bound on II (denoted normally as ResMII) is determined by the quantity of resources (computational tools of the microprocessor: adders, multiplicators, etc.) available in the processor and by the resources that are needed to execute a single loop iteration. In the general case, the calculation of the accurate lower bound of II by resources is an NP-complete problem; therefore, a lower bound of this value is usually used in applications; for example, this may be max(Li /Ri ) over i, where Ri is the total number of units of the i-th type of resources and Li is the total need of resources of the i-th type by all instructions of a single iteration of the loop [13]. The recurrence-constrained lower bound on II (denoted usually as RecMII) is nonzero for loops with cyclically induced dependences (i.e., dependences between instructions in different iterations of the loop). PROGRAMMING AND COMPUTER SOFTWARE

Vol. 33

Example 2. Let us consider the following recurrent dependence: for (i=0; i< N; i++) B[i] = B[i–2] + C,

where B is an array of floating-point variables. Let the operation of addition of floating-point variables have a latency of 4 and the remaining operations are executed in 1 clock cycle. Figure 2 presents the data dependence graph for the instructions implementing the body of this loop. The loop arcs are marked by pairs of the form (d, p), where d is the pipeline delay between corresponding instructions and p is the dependence distance (i.e., the number of iterations executed between instructions with a cyclic dependence). For dependences between instructions of the same iteration, we have p = 0. As it can be seen, the dependence graph has a loop with an inverse arc describing the recurrent dependence with a distance of 2: this arc indicates that the instruction t1 = B[i – 2] of the k th iteration should be executed at least by one clock cycle later than the instruction B[i ] = t2, which belongs to the iteration k – 2. In the general case, if a loop contains recurrent dependences, a lower bound on RecMII is maxc(d(c)/p(c)), where c is the loop in the dependence graph, d(c) is the sum of delays over the loop c, and p(c) is the sum of dependence distances over the loop c (see, for example, [13], p. 321). Note that the general limitation of the applicability of any loop pipelining method is that the dependence distances must be constant and known at the stage of compilation. If this is not the case, the unknown distance of dependence can be assumed to be equal to 1 (as in the implementation in [10]). Estimate for the number of required registers. The number of physical registers required for the kernel of the pipelined loop is normally estimated by the value denoted as MaxLive; it is equal to the maximum number of open register lifetimes over all clock cycles of the kernel execution. The lifetime of a register variable starts with the clock cycle at which this variable is assigned to the register and terminates with the clock cycle at which it is last used by some instruction. Figure 3 shows the lifetimes of variables for the example given in Fig. 1. It can be seen that the lifetimes of variable vr1 and vr2 exceed II = 2. For these variables, it is necessary to provide more than one physical register (see No. 6

2007

310

V’YUKOVA et al. Lifetimes of variables vr0 vr1 vr2 load x

load y *

+ Store

Register requirements vr0 vr1 vr2

vr3

Fig. 3. Lifetimes of virtual registers for the loop in Example 1.

section 2.4). The value of MaxLive for this example is equal to 8. It was proved in [5] that, for any loop, there exists a distribution of registers requiring not more than MaxLive registers. 2.2. Modulo Scheduling Algorithms The method of modulo scheduling produces a schedule for performing iterations of the loop (with account for the dependences between instructions) such that the iterations run with a constant interval II have no resource-constrained conflicts [9–11]. There are approaches based on a search for an optimal schedule through the whole space of schedules and heuristic scheduling algorithms. The search for an optimal schedule can be performed by searching through all possible schedules [14] or by formulating the problem of modulo scheduling as an integer linear programming problem with the subsequent application of some solver [5, 15–17]. Usually, one wants to find a schedule with a minimum number of required registers among the schedules with the minimal II. Because modulo scheduling is an NP-complete problem, exact methods require extremely high computational effort; they are actually inapplicable in commercial compilers and are used mainly for research purposes. Heuristic modulo scheduling algorithms, which are considered below, make it possible to quickly find reasonably good schedules; they are applied in many modern compilers [10, 11, 13, 18]. Let us consider the general scheme of an heuristic modulo scheduling algorithm.

Finding the range of II. First, we determine the range of values of II in which a solution (i.e., a schedule (plan) of the loop kernel execution) will be sought. The lower bound of II (denoted usually as MII) is usually set to max(RecMII, ResMII) (see Section 2.1 above). The upper bound of II is set to the number of cycles required for the execution of a single iteration of the original loop (without iteration overlapping) minus one. Then, one searches for the minimum II in this range for which one manages to build the kernel with a schedule that can be executed in II cycles and has no dependence delays or resource conflicts. Normally, one performs a successive search through the values of II starting from MII; according to [13], the successive search is preferable to (for example) the binary search because, in most cases, one can find a solution close to MII. Scheduling. First, the instructions are somehow ordered and, then, one tries to associate each of them with the number of the cock cycle at which it will be executed. Different heuristic modulo scheduling algorithms differ by the methods of initial instruction ordering and by heuristics used for searching a suitable clock cycle (position in the schedule). At each step of the algorithm, one creates an appropriate partial schedule; this means that the assignments of positions for instructions should not break the by resource constraints or invoke delays due to the unavailability of arguments. The resource constraints are controlled by the socalled modulo reservation table (MRT). The height of this table is equal to II and the columns correspond to the existing resources (functional processor units). If the resource r is used at the cycle t, the element MRT[r, t, mod II] is labeled in the table. At some point in time, it can happen that no suitable position can be found for the next instruction. In this case, one should either increase II or backtrack the steps executed earlier (i.e., some instructions are taken out of the schedule and returned to the scheduling queue). The backtracking algorithms are called iterative (see [9, 19]). To reduce the number of backtracks in iterative algorithms, one sets a budget ratio; when this budget is exhausted, the search is terminated and the search for the next value of II is initiated. Even without a budget limitation, the iterative algorithms give no guarantee that all possible partial schedules will be examined because they can cyclically search through a subset of schedules. An increased value of II reduces the degree of parallelism but facilitates the search for a solution. Note that the scheduling heuristics must be sensitive to the value of II; i.e., one should exclude the case when there is no difference between searches under increased and previous values of II. The algorithm is terminated successfully if one happens to find a schedule for a value of II within the given range of values. Now, let us consider specific features of some wellknown algorithms of heuristic modulo scheduling.


Vol. 33

No. 6

2007

SOFTWARE PIPELINING OF LOOPS BY THE METHOD

Iterative Modulo Scheduling (IMS) [9] is a an iterative algorithm that orders instructions by their height in the dependence graph and tries to find for each instruction the “earliest” position that is admissible from the standpoint that there are no resource conflicts. If no suitable position for some instruction can be found, one of the instructions scheduled earlier is returned to the scheduling queue to free a space for the instruction under examination. The possibility of backtracking within the given budget is a key property of the IMS; due to this feature, the algorithm in some cases gives lower values of II than non-iterative algorithms. A disadvantage of the IMS is that there are no heuristics that minimize register requirements. Slack Modulo Scheduling (Slack) [20]. In contrast to the greater part of other algorithms that perform a bottom-up or top-down search for a free slot for an instruction, it uses bidirectional scheduling. Another feature of this method is the method used to order the instructions in the scheduling queue: the instructions are ordered with respect to their “degree of mobility” (slack). The degree of mobility of an instruction relative to a partial schedule is measured by the number of slots where this instruction might be scheduled. The position for an instruction is chosen so as to minimize the total lifetimes of its input and output arguments. Like IMS, Slack is an iterative method. Integrated Register Sensitive Iterative Software Pipelining (IRIS) [21] differs from IMS by the use of stage-scheduling heuristics during the instruction scheduling (see below). An instruction is assigned with the earliest or latest position so that the need of registers be minimized. Swing Modulo Scheduling (SMS) [10, 22] is a non-iterative and, thus, very fast modulo scheduling algorithm. The instructions belonging to loops in the dependence graph are initially ordered by the value of RecMII of the corresponding loop; the degree of criticality of the path to where the instruction belongs is used as a secondary factor. The method of instruction ordering ensures that, before the given instruction, the scheduling queue may include only either predecessors or successors of the instruction (but not both of them simultaneously). The position in the schedule for the next instruction is chosen so as to minimize the lifetimes of input registers (if the predecessors of this instruction have already been scheduled) or output registers (if the successors of this instruction have already been scheduled). This strategy is aimed at minimizing the number of registers. LxMS [11] combines the benefits of IMS and SMS. This is an iterative modulo scheduling algorithm, where, as in SMS, the instructions belonging to loops of the dependence graph are preferred and the criticality of the paths containing this instruction is taken into account. In addition, this algorithm uses heuristics oriented to the reduction of the code size for pipelined PROGRAMMING AND COMPUTER SOFTWARE

Vol. 33

311

loops and shows good results for complex architectures and/or loops with complicated recursive dependences. Stage Scheduling [15, 23] is rather a postprocessing technique that can be applied to a schedule obtained by any modulo scheduling algorithm to reduce register requirements. In this case, the position of an instruction in the schedule can be moved upward or downward by a quantity multiple of II (so that the modulo reservation table remains unchanged) and heuristics are used to minimize the maximum number of “live” variables over all cycles of the schedule. Loops in compilers are usually pipelined before the distribution of registers. If the number of registers required exceeds the number of physical registers, one has to temporarily free some registers, using an additional code for their storage in the memory and further restoration (spill code) that degrades the program performance. To control this effect, one can use the following approaches: (1) During modulo scheduling, the schedules requiring more than a given number of registers are rejected. (2) Modulo scheduling is coupled with the distribution of registers, and the register spill code (if needed) is scheduled together with the instructions of the initial loop. These approaches ensure that the distribution of registers does not reduce the throughput due to the use of additional spill code. There are some modifications of of modulo scheduling algorithms designed for cluster architectures [24]. In [25], the authors consider a problem that is interesting for architectures supporting operations with short vectors: an optimal joint use of data parallelism and instruction-level parallelism in pipelined loops. The processor is regarded as a cluster including a node of vector operations and a node of scalar operations. Before the loop is pipelined, the set of operations is distributed between the vector and the scalar nodes according to a cost function that takes into account the uniformity of the workload of the functional units of both nodes and the overhead for data transmission between the nodes. 2.3. Comparative Analysis of Algorithms of Modulo Scheduling The fact that there are many heuristics used in modulo scheduling methods indicates that one cannot find a universal approach suitable for all criteria and microprocessor architectures. The results of a comparative analysis for IMIS, Slack, IRIS, and SMS (with a postprocessing by the method of stage scheduling) methods can be found in [12]. The analysis is based on the tests Perfect Club and SPECfp95 for architectures of the following three types: simple, medium, and complex. The complexity of an architecture is defined by such characteristics as the presence of complex resource constraints, the availability of non-pipelined or incomNo. 6

2007

312

V’YUKOVA et al.

pletely pipelined operations, and the low-level parallelism used. The methods were assessed, in particular, by the efficiency of generated loops ( II / MII ), where the sum is taken over all tests, by the usage of registers, and by the time of execution. In all the cases, almost all methods (with the exception of Slack) were successful in pipelining (i.e., they found schedules that were more efficient that the initial loop). The stage scheduling is efficient for all the methods but requires substantial computational effort. In terms of the execution time, SMS proved to be the best method in all the cases, as expected, due to the fact that it is non-iterative. SMS proved to be the best also in terms of register saving: even without stage scheduling, the usage of registers in SMS (and Slack) turns out to be lower than in IMS or IRIS with a further stage scheduling. For simple architectures, the quality characteristics of the generated code proved to be similar for all the methods. For almost all loops, the resulting schedules II / MII is in the are optimal (II = MII); the ratio range from 1.0002 (SMS) to 1.0035 (IRIS). However, with the growth of the architectural complexity of the target processor, the iterative methods prove to be advantageous in terms of this factor, and some differences by other factors start to emerge. Another comparison between different methods of modulo scheduling for other test sets (see [11]) concludes that SMS fails to cope with some loops having complex recursive dependences.

∑ ∑

∑ ∑

2.4. Code Generation for Pipelined Loops Software pipelining can lead to a considerable increase in the code size. This is explained, first of all, by the need of generating prologs and epilogs (Fig. 1) to ensure that the pipelined loop is equivalent to the initial loop, and, second, by the use of the so-called modulo variable expansion (MVE) for variables whose lifetime is longer than II as a result of pipelining. If the lifetime of a variable in the pipelined loop exceeds II, it must be assign more than one register so that different registers are used at different iteration. The variable can be duplicated by inserting additional instructions into the kernel code for moving the “cloned” variables from one register to another. This approach is suitable for architectures with high-level parallelism; thus, such transmissions in most cases would find a free slot in the schedule. For low-level parallelism of the processor, the insertion of transmissions leads to increased II; therefore, in these cases, it is preferential to use MVE. This technique implies that the kernel body of the pipelined loop is expanded with a “cloning” of registers so that the same variable resides in different registers at different iterations. The expansion coefficient can be calculated as GCD(K1, …, Km), where K1, …, Km are the numbers of the required clones of variables in the loop. Here, the

number of registers used to clone the variables is a minimized. However, in practice, some registers are usually kept unused and the code size is instead reduced with the expansion coefficient equal to max(K1, …, Km) (see, for example, [11]). Note that the kernel unrolling leads to the need of an appropriate multiplication of epilogs. Many processor architectures have hardware tools for pipelining, including predicate execution and rolling of the register file. Predicate execution makes it possible to hide the prologs and epilogs into the loop kernel, suppressing the execution of individual instructions of the initial and final iterations. The mechanism of register file rolling ensures that the variables with lifetimes exceeding II are expanded. However, as indicated in [11], the microprocessors used in built-in systems normally have no such hardware tools, because their implementation makes the architecture highly complicated and requires the increase of the chip size; it also leads to a complicated structure of the instruction word and, accordingly, the stage of instruction decoding in the pipeline. At the same time, the code is usually a critical factor for such systems. The LxMS pipelining method, described in the above-mentioned study, uses a series of mechanisms allowing the code size to be significantly reduced. Firstly, the modulo scheduling includes heuristics aimed at decreasing the maximum lifetime of vaariables and at reducing the number of stages (the number of overlapping iterations in the kernel). Both factors affect the code size: the expansion coefficient in the MVE depends on the lifetime of variables, and the number of stages controls the size of the prolog and epilog. Secondly, one uses code-generation schemes aimed at decreasing the size of epilogs. In particular, two such schemes are proposed: adding a preliminary loop and speculative modulo scheduling. The first technique implies that, before the pipelined loop, the initial loop is executed with such a number of iterations that the number of iterations of the pipelined loop becomes a multiple of the expansion coefficient of MVE, which makes it possible to generate only a single epilog. It is reasonable to use this method only in the cases when the code saving achieved by reducing the number of epilogs exceeds the size of the initial loop. A disadvantage of this method is that the initial iterations are executed inefficiently. The second technique implies that some instructions from the additional iterations that should not be executed are actually executed. This makes it possible to completely or partially hide the epilogs into the kernel code. Indeed, in the example of the pipelined loop (see Fig. 1c), one can delete the epilog and increase the number of iterations by 3. The epilog instructions will be executed as additional iterations; however, some “extra” instructions will also be executed. Therefore, this approach requires some additional actions taken during the code generation. For example, the use of speculative calculations in the example of Fig. 1c means that some values out of the range of the arrays X and Y will be taken. Consequently,


Vol. 33

No. 6

2007

SOFTWARE PIPELINING OF LOOPS BY THE METHOD

it is necessary to allocate additional memory for these arrays and take care of the corresponding initialization of extra elements of the arrays. Along with the codesize reduction, the speculative modulo scheduling has another advantage: it is possible to pipeline loops with a priori unknown number of iterations. 3. GLOBAL MODULO SCHEDULING The modulo-scheduling algorithms discussed above can be applied only to simple loops with bodies containing a single basic block. For loops with several basic blocks, there are methods of global modulo scheduling. The essence of all these methods is that, initially, a complex loop is somehow transformed into a serial code, which is then subjected to modulo scheduling. In this case, the code generation is somewhat complicated since the resulting loop requires a reproduction of the flow control of the initial loop. In addition to the global modulo scheduling methods considered below, there are global scheduling methods based on other pipelining mechanisms such as expansion and packing (perfect pipelining) [7, 26–28]. Hierarchical reduction [13] is a method that transforms loops containing conditional constructs into a serial code that can be pipelined using any of the abovementioned of modulo scheduling methods. Here, the conditional construct is transformed into a single node of the dependence graph in which the resource requirements are defined as max(R1, R2), where R1 and R2 are the resource requirements of the two branches of the conditional construct. Thus, the resource requirements of this node corresponds to the maximum of resource requirements of all the instructions in both branches of the conditional construct. This method is an improvement of the earlier version [29], where the nodes corresponding to conditional constructs were regarded as “black boxes” consuming all the resources of the pipeline. One can apply the reduction also to inner loops making it possible to pipeline loop nests. The reduction is applied first to the lower level constructs and then hierarchically to the enclosing constructs. The main advantage of the hierarchical reduction is that one can apply pipelining to a wider class of loops when no hardware support (predicate execution) is available. This reduction “contracts” the code in long bodies of loops with conditional constructs. In this case, the code outside the conditional constructs and independent of them can be moved around these constructs or be executed simultaneously with them. The method of if-conversion implies that the control dependence is associated with a logical variable (flag), which makes it possible to transform the control dependences into data dependences [30]. The conditional branches are replaced by flag-setting instructions. The instructions that depended on conditional branches become predicate instructions (i.e., they are PROGRAMMING AND COMPUTER SOFTWARE

Vol. 33

313

executed only if an appropriate flag is set). Normally, this method is used when the hardware supports predicate execution (see, for example, [31]); however, there is an improved approach [32] that can be used when the hardware does not support predicate execution. Improved Modulo Scheduling [32] makes it possible to combine the benefits of the hierarchical reduction and if-conversion methods, by using if-conversion (without hardware support) to transform conditional constructs into a serial code and using the mechanism of generation of conditional braches after modulo scheduling (as in the hierarchical reduction). The scheduling uses an expanded resource reservation table with elements of the three types: free, occupied, and partially occupied. The latter means that one can schedule additional instructions from other execution paths to this position. A limitation of this method is that it assumes that there are no memory access dependences between iterations, which significantly restricts the application of this method. Modulo Scheduling for Superblock [18] is an extension of SMS [22] for loops whose body is a superblock (a set of basic blocks with the control flow having a single entry and several exits). In this case, if the execution profile is available, the maximum efficiency of the code execution along most frequent paths in the control-flow graph is achieved. 4. CONCLUSIONS Loop pipelining makes it possible to combine the execution of instructions belonging to different loop iterations, which leads to a considerable improvement of the code performance, since loops consume the greater part of the program execution time. Modern commercial compilers mainly use heuristic algorithms based on modulo scheduling methods for loop pipelining. These methods can be applied to a fairly wide class of loops, including those consisting of several basic blocks, as well as to the loops in which the number of iterations is not known in advance. Experimental data for architectures with a high degree of instruction-level parallelism a sufficient number of registers, and without complex resource dependences indicate that heuristic modulo scheduling algorithms produce an almost optimal code. An important area of research remains to be the analysis and design of loop pipelining algorithms for processors with limited hardware resources with regard to such factors as the low-level parallelism, a small number of registers (or the availability of groups of heterogeneous registers), and complex resource dependences in the pipeline. According to the tests performed for processors with limited hardware resources presented in [11, 12], the code generated by some modulo scheduling algorithms used in commercial compliers is often far from being optimal. The problem of pipelining No. 6

2007

314

V’YUKOVA et al.

becomes complicated also because of complex recurrent dependences in loops. When the efficiency of application program is of prime importance and the compilation time plays no substantial role, it seems appropriate to use more laborious exact methods for finding the optimal solution, such as search algorithms or integer linear programming.

15. Eichenberger, A., Davidson E., and Abraham, S., Optimum Modulo Schedules for Minimum Register Requirements, Proc. Int. Conf. on Supercomputing, July, 1995, pp. 31–40.

REFERENCES

17. Govindarajan, R., Altman, E., and Gao, G., Minimizing Register Requirements under Resource Constrained Rate-Optimal Software Pipelining, Proc. of the 27th Annual Int. Symp. on Microarchitecture, November, 1994, pp. 85–94.

1. Shumakov, S. M., Review of Code-Optimization Methods for Processors with Instruction-Level Parallelism, Moscow: Institut sistemnykh issledovanii, RAN, 2002. 2. Fisher, J. A., Global Code Generation for InstructionLevel Parallelism: Trace Scheduling-2, Technical Report HPL-93-43, Hewlett-Packard Laboratories, June, 1993. 3. Grossman, J. P., Compiler and Architectural Techniques for Improving Effectiveness of VLIW Compilation, ICCD, 2000. 4. Hwu, W. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., Hank, R. E., Kiyohara, T., Haab, G. E., Holm, J. G., and Lavery, D. M., The Superblock: An Effective Technique for VLIW and Superscalar Compilation, J. Supercomputing, 1993, vol. 7, May, pp. 229–249. 5. Eichenberger, A., Modulo Scheduling, Machine Representations, and Register-Sensitive Algorithms, Ph.D. Thesis, Univ. of Michigan, 1997. 6. Chao, L.-F., LaPaugh, A. S., and Sha, E. H., Rotation Scheduling: A Loop Pipelining Algorithm, IEEE, 1997. 7. Aiken, A. and Nicolau, A., Resource-Constrained Software Pipelining, IEEE Trans. Parallel Distributed Syst., 1995, vol. 6, no. 12, pp. 1248–1269. 8. Frantsuzov, Yu. A., Review of Code-Parallelization and Loop Pipelining Methods, Programmirovanie, 1992, no. 3, pp. 16–37. 9. Rau, B. R., Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops, Proc. of the 27th Int. Symp. on Microarchitecture, Nov. 1994, pp. 63–74. 10. Hagog, M., Swing Modulo Scheduling for GCC, Proc. of the GCC Developer’s Summit, Ottawa, Canada, 2004, pp. 55–65. 11. Llosa, J. and Freudenberger, S. M., Reduced Code Size Modulo Scheduling in the Absence of Hardware Support, Proc. of the 35th Annual Int. Symp. on Microarchitecture (MICRO-35), November, 2002. 12. Codina J. M., Llosa J., and Gonzalez, A., A Comparative Study of Modulo Scheduling Techniques Proc. of the 16th Int. Conf. on Supercomputing (ICS'02), New York: ACM Press, 2002, pp. 97–106. 13. Lam, M. S., Software Pipelining: An Effective Scheduling Technique for VLIW Machines, Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation, June, 1988, pp. 318–328. 14. Altman, E. R. and Gao, G. R., Optimal Modulo Scheduling Through Enumeration, Int. J. Parallel Programming, 1986, vol. 26, no. 3, pp. 313–344.

16. Ning, Q. and Gao, G. R., A Novel Framework of Register Allocation for Software Pipelining, Proc. of the Twentieth Annual ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, 1993, pp. 29–42.

18. Lattner, T. M., An Implementation of Swing Modulo Scheduling with Extensions for Superblocks, Ph.D. Thesis, Univ. of Illinois at Urbana-Champaign, 2005. 19. Rubanov, V. V., Grinevich, A. I., and Markovtsev, D. A., Specific Optimization Features in a C compiler for DSPs, Programmirovanie, 2006, no. 1, pp. 26–40. 20. Huff, R. A., A Lifetime Sensitive Modulo Scheduling, Proc. of the ACM SIGPLAN'93 Conf. on Programming Languages and Implementation, Albuquerque, New Mexico, 1993, pp. 258–267. 21. Dani, A. K., Ramanan, V. J., and Govindarajan, R., Register-Sensitive Software Pipelining, Proc. of the Merged 12th Int. Parallel Processing Symp. and 9th Int. Symp. on Parallel and Distributed Systems, 1998. 22. Llosa, J., Gonzalez, A., Ayguade, E., and Valero, M., Swing Modulo Scheduling: A Lifetime-Sensitive Approach, Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques, 1996. 23. Eichenberger, A. and Davidson, E., Stage Scheduling: A Technique to Reduce the Register Requirements of a Modulo Schedule, Proc. of the 28th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO-28), 1995. 24. Codina, J. M., Sanchez, J., and Gonzalez, A., A Unified Modulo Scheduling and Register Allocation Technique for Cluster Processors, Proc. of Int. Conf. on Parallel Architectures and Compilation Techniques, September, 2001, pp. 175–184. 25. Larsen, S., Rabbah, R., and Amarasinghe, S., Exploiting Vector Parallelism in Software Pipelined Loops Topology, Technical Report, Massachsetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory, Technical Report, June, 2005. 26. Ebcioglu, K. and Nakatani, T., A New Compilation Technique for Parallelizing Loops with Unpredictable Branches on a VLIW Architecture, Selected Papers of the Second Workshop on Languages and Compilers for Parallel Computing, London: Pitman, 1990, pp. 213– 229. 27. Su, B. and Gurpr, J., A New Global Software Pipelining Algorithm, Proc. of the 24th Annual Int. Symp. on Microarchitecture (MICRO 24), New York: ACM Press, 1991, pp. 212–216.


Vol. 33

No. 6

2007

SOFTWARE PIPELINING OF LOOPS BY THE METHOD 28. Evstigneev, V. A., Some Features of Software for Computer with Large Instruction Word (Review), Programmirovanie, 1991, no. 2, pp. 69–80. 29. Wood, G., Global Optimization of Microprograms Through Modular Control Constructs, Proc. of the 12th Annual Workshop on Microprogramming (MICRO-12), Piscataway, NJ: IEEE Press, 1979, pp. 1–6. 30. Allen, J. R., Kennedy, K., Porterfield, C., and Warren, J., Conversion of Control Dependence to Data Dependence, Proc. of the 10th ACM SIGACT-SIGPLAN Symp. on


Vol. 33

315

Principles of Programming Languages (POPL'83), New York: ACM Press, 1983, pp. 177–189. 31. Smelyanskiy, M., Mahlke, S., and Davidson, E. S., Probabilistic Predicate-Aware Modulo Scheduling, Proc. of the Int. Symp. on Code Generation and Optimization (CGO 2004), 2004. 32. Warter, N. J., Haab, G. E., Subramanian, K., and Bockhaus, J. W., Enhanced Modulo Scheduling for Loops with Conditional Branches, Proc. of the 25th Annual Int. Symp. on Microarchitecture (MICRO-25), Los Alamitos, CA: IEEE Computer Society Press, 1992, pp. 170–179.

No. 6

2007

Software Pipelining of Loops by the Method of Modulo Scheduling

Software Pipelining of Loops by the Method of Modulo Scheduling

Suggest Documents

Modulo Scheduling of Loops in Control-Intensive Non ... - CiteSeerX

Software Pipelining of Nested Loops for Real-Time DSP Applications

Single-Dimension Software Pipelining for Multi-Dimensional Loops

Enhanced Modulo Scheduling for Loops with ... - Semantic Scholar

Single-Dimension Software Pipelining for Multi-Dimensional Loops

Distributed Modulo Scheduling - CiteSeerX

A Scheduling and Pipelining Algorithm for Hardware/Software Systems

Software pipelining - Computer Sciences

On Pipelining Sequences of Data-Dependent Loops - CiteSeerX

Register Allocation for Modulo Scheduled Loops - CiteSeerX

Maximum-Throughput Software Pipelining - CiteSeerX

Performance Scalability of Decoupled Software Pipelining

Decoupled Software Pipelining with the Synchronization Array

Optimization of Nest-Loop Software Pipelining

Complementing Software Pipelining with Software Thread Integration

Fast Modulo Scheduling Under the Simplex Scheduling ... - Google Sites

Speculative Decoupled Software Pipelining - CiteSeerX

Decoupled Software Pipelining with the Synchronization Array

Rotation Scheduling: A Loop Pipelining Algorithm - CiteSeerX

Pipelining the Fast Multipole Method over a

Reduced Code Size Modulo Scheduling in the Absence of Hardware

Validating Software Pipelining Optimizations - NYU Computer Science

Parallel-Stage Decoupled Software Pipelining - CiteSeerX

MODULO SCHEDULING WITH REGULAR UNWINDING 1 Introduction