Theory of Modulo-Scheduled Pipelines - Semantic Scholar

1 downloads 0 Views 365KB Size Report
ing the upper bound on the number of initiations (henceforth referred to as UB Init) during II cycles for a xed resource reservation table. The condition is powerful, ...
Theory of Modulo-Scheduled Pipelines 

R. Govindarajan

Supercomputer Education and Research Centre Dept. of Computer Science and Automation Indian Institute of Science Bangalore, 560 012, India [email protected]

Erik R. Altman

IBM T.J. Watson Research Center Yorktown Heights, NY 10598, U.S.A. [email protected]

Guang R. Gao

Dept. of Electrical Engineering & Computer Information Sciences University of Delaware Newark, DE 19716, U.S.A. [email protected] This work was supported by research grants from NSERC, Micronet { Network Centers of Excellence, Canada, and the Memorial University of Newfoundland, President's Research Grant. 

1

Theory of Modulo-Scheduled Pipelines Abstract Extensive research work has been reported on the scheduling of hardware pipelines and software pipelining independently, One interesting problem is how to extend hardware pipeline theory to handle software pipelined loops in user programs. More specifically, the problem is how to maximize the utilization of hardware resources under modulo scheduling for a given II. In this paper, we develop a nite state automaton (FSA) based framework for analyzing and improving the utilization of ModuloScheduled (MS) pipelines | hardware pipeline structures operating under software pipelining. The main contributions of this paper are: 1. We establish, under certain conditions, a necessary and sucient condition for achieving the upper bound (UB Init ) on the number of initiations in the given pipeline (with a xed resource reservation table) (Theorem 4.1). The condition is quite powerful, yet surprisingly simple to check. 2. We demonstrate that the pipeline recon guration method (e.g. changing the reservation table by introducing delays in the pipeline) from classical pipeline theory can be adapted to improve the utilization of hardware pipeline, and hence obtain better modulo schedules. 3. We establish that such pipeline recon guration method can always achieve UB Init , and hence the maximum utilization of hardware pipelines (Theorem 5.1). A procedure to accomplish delay insertion has been developed and implemented. 4. Our initial experiments show that with the insertion of a small delay, often 1 cycle, upto 90% of UB Init (and maximum utilization) of the hardware pipeline can be achieved for a wide range of IIs. This is a further evidence of the usefulness of the proposed MS-pipeline theory and the delay insertion method.

Keywords:

Pipeline Architecture, Software Pipelining, Classical Pipeline Theory, Modulo-Scheduled Pipelines, VLIW/Superscalar Architectures.

1

1 Introduction High-performance compilers perform aggressive instruction scheduling to fully exploit the multiple issue, multiple execution capabilities of modern superscalar and VLIW processors. A legal instruction schedule must be compatible with any structural hazards | e.g. contention for hardware resources by instructions. Instruction schedulers [5, 20, 6, 7, 13, 15, 11, 10, 3, 19, 21] in modern compilers check and avoid structural hazards. One approach followed by instruction schedulers is to explicitly model the architecture resource by simulating the instruction execution. At each cycle, the scheduler will maintain a list of resources committed in the current and future cycles, usually via a resource reservation table. When a new instruction is to be scheduled, its resource usage will be checked against the reservation table for any possible structural hazard with the instructions scheduled earlier. The instruction is scheduled in the current cycle only if there is no structural hazard. This method has been used in several production compilers [3, 5, 22]. One drawback of this method is its eciency: the resource reservation table has a size of O(m  n) where m is the number of resources and n is the (longest) execution time of the instructions. To overcome the above drawbacks, an FSA ( nite state automata) based instruction scheduling technique | originating from the classical hardware pipeline theory [17, 12] { has been proposed. In this method, the processor resource usage is modeled by constructing an FSA from the resource usage table of each instruction (or instruction class). This has e ectively reduced the problem of checking structural hazards to a fast table lookup, thereby getting a good speedup in the scheduling time [16, 18]. More recently, the FSA method has been applied to the KSR (Kendall Square Research) compiler, and further extended to the binary code translation for DEC ALPHA 21064 [1]. One major challenge is how to extend the hardware pipeline theory to handle software pipelining or modulo scheduling [20, 13, 10, 3, 19, 21]. In Section 2.2 we provide a brief introduction to software pipelining. A comprehensive survey on several software pipelining methods can be found in [19]. Under modulo scheduling, successive instances of one operation are initiated at time steps separated by a constant interval, known as the initiation interval II. The FSA-based approaches proposed in [16, 18, 1] do not take into account the repeated initiations of an instruction once every II cycles. As a consequence these methods 2

cannot be applied directly to modulo-scheduled loops. The Co-Scheduling framework proposed independently in [9] develops an FSA-based method for software pipelining. Based on classical pipeline theory [17], the Co-Scheduling method constructs a nite state automata to determine latency sequences that result in maximum number of initiations (Max Init ) in a hardware pipeline operating under modulo scheduling. In earlier work [16, 18, 1, 9], the emphasis is on how to perform instruction scheduling in an ecient way for the given target architecture. In other words, these methods did not consider how to improve upon Max Init by modifying the the resource usage pattern of the pipeline via delay insertion, a method discussed in [17]. This paper investigates a method to derive the resource usage patterns that improve the number of initiations beyond Max Init of a pipeline operating under modulo scheduling | henceforth called moduloscheduled pipeline (MS-pipeline) | and the underlying theory, called MS-pipeline theory. The main contributions of this paper are: 1. We establish, under certain conditions, a necessary and sucient condition for achieving the upper bound on the number of initiations (henceforth referred to as UB Init ) during II cycles for a xed resource reservation table. The condition is powerful, yet surprisingly simple to check. 2. We demonstrate that the delay insertion method of classical pipeline theory can be adapted to modify the MS-pipeline con guration so as to achieve increased number of initiations in a modulo schedule. 3. We establish that, by using the delay insertion method, Max Init can always be improved to reach UB Init , the theoretical upper bound on the number of initiations in II cycles. This facilitates to achieve higher utilization of the hardware pipeline and hence obtain better modulo schedules. 4. We study the e ectiveness of the proposed MS-pipeline theory and the delay insertion method by conducting experiments on several reservation tables. Our initial experiments show that with the insertion of a small delay, often 1 cycle, upto 90% of UB Init (and maximum utilization) of the hardware pipeline can be achieved for a wide range of IIs. 3

The proposed theory of MS-pipelines and the methods to improve its performance are of interest to hardware pipeline designers and compiler writers (for software pipelining). On the hardware side, it provides useful hints to computer architects in deciding how to modify the resource usage, and hence choose the di erent versions to be supported for an instruction, in order to get higher throughput (under modulo scheduling) from their pipelines. A simple, but e ective way to accomplish the e ect of delay insertion is to support multiple versions of a single instruction; i.e., di erent versions requiring a similar but di erent resource usage patterns of the hardware pipeline. The compiler can be allowed to choose the most appropriate version of an instruction so as to maximize the throughput for a given II. In Section 6, we present the sketch of a software pipelining method that follows this approach and attempts to choose the most appropriate version for each instruction class for a given loop. Further, it was demonstrated in [9], that the basic MS-pipeline theory (without delay insertion) helps to obtain ecient software pipelined schedules. Lastly, the proposed MSpipeline theory is also directly useful to high-level system synthesis where hardware structures are designed to achieve maximum utilization of resources for a speci c application. The rest of the paper is organized as follows. In the following section we motivate the need for the theory of MS-pipelines through a number of examples. The subsequent section develops the basic framework of MS-pipelines. In Section 4 we establish the conditions for achieving a given number of initiation in an MS-pipeline. Section 5 describes a method to introduce delays in the pipeline to achieve a speci c initialization sequence. A brief sketch of a software pipelining method that uses the MS-pipeline theory developed in this paper is presented in Section 6. Experimental results on the usefulness of the MS-pipeline theory are reported in Section 7. Related works and concluding remarks are presented in Sections 8 and 9.

2 MS-Pipelines: Motivation and Problem Statement In this section, we motivate the need for modulo-scheduled pipelines (MS-pipelines) and the underlying theory for scheduling them. In section 2.1, we brie y introduce the terminology used in classical pipeline theory. Next, we demonstrate the need for MS-pipelines in section 2.2 with the help of several motivating examples. Finally, in section 2.3, we present the problem statement for MS-pipeline theory. 4

2.1 Terminology First let us de ne a number of terms used in classical pipeline theory [12]. The resource usage of various stages of a hardware pipeline is represented by a two dimensional Reservation Table. An X mark in i-th row j -th column (henceforth referred to as (i; j )) indicates that the i-th stage is required j time steps after the operation was initiated. The time between the initiations of two operations is termed as latency. A latency is set to cause a collision if the two operations require the same stage of the pipeline at the same time. Multiple operations can simultaneously be processed in the pipeline as long as there is no collision. If two operations entering a pipeline l cycles apart do not cause a collision, then l is termed a permissible latency; otherwise it is forbidden. For the reservation table shown in Figure 1(a), latencies 2 and 3 are forbidden; all other latencies, i.e. 1; 4; 5;    are permissible . 1

Stage 1 2 3 4

Time Steps 0 1 2 3 4 x x x x x x (a)

10110 >4

>4 1 11110

(b)

Figure 1: An Example Reservation Table and its State Diagram Classical pipeline theory identi es initiation sequences or latency sequences which maximize the throughput and the utilization of the pipeline using state diagrams [12]. Each state in the state diagram is represented by a collision vector which speci es the forbidden latencies (indicated by 1) and permissible latencies (indicated by 0) in the current state. Arcs in the state diagram indicate the state transition by allowing a new initiation from Latency 0 is forbidden for single function pipelines | pipelines which support exactly one type of operation. 1

5

the current state at speci ed (permissible) latency. Refer to [12] for the construction of the state diagram. From the state diagram shown in Figure 1(b), we can identify a latency cycle f1; 4g that yields 2 initiations in 5 cycles. This latency cycle f1; 4g yields the maximum throughput of initiations, with a period of 5, for the given reservation table. It is important to notice here that a latency cycle and its period are solely determined by the resource usage of the operation, i.e. the reservation table associated with the pipeline. The term cyclic scheduling of pipelines has been used when pipelines are scheduled are under such identi ed latency cycle. Identifying latency cycles that maximize both the throughput and the utilization of the stages was e ectively employed in pipeline and vector architectures. 2 5

2.2 Modulo-Scheduled Pipelines: Motivation Modulo scheduling or software pipelining [20, 13, 10, 19, 3, 21] is a compiling technique to extract higher instruction-level parallelism in loops. Under this framework, scheduling the instructions of the loop results in the repetitive initiation of operations in a hardware pipeline. As noted in the Introduction, the initiation interval II between two such initiations of an operation, of the software pipelined schedule, is determined by both the loop carried dependences (or recurrences) in the loop and the resource availability and usage [20, 13, 10, 19]. In a linear, periodic software pipelined schedule with an initiation interval II, an instruction i in iteration j is initiated at time j  II + ti. Suppose two oating point multiply instructions i and i of a loop are to be scheduled with an II = 6, in a single pipeline whose reservation table is shown in Figure 1(a). Classical pipeline theory analysis revealed that f1; 4g is a permissible latency cycle, accommodating 2 initiations in 5 cycle. However, we may or may not be able to \ t" the latency cycle for the current II = 6. In this case, the latency cycle f1; 4g does not match the given II. More speci cally, latency 4 which was originally permissible for the given pipeline, causes a collision when initiations are repeated with an II of 6. To illustrate this, assume instructions i and i are scheduled at time steps 0 and 4, with a latency 4 between them. The resource usage of these initiations is represented in a Modulo Reservation Table (MRT) in Figure 2. In the MRT, the resource usage beyond time step 6 wraps around, i.e. an usage in time step t is shown in column t mod 6. We 1

2

1

6

2

represent the resource usage of rst initiation by 0 and that of the second at time 4 by 4. Additionally we use the symbol ` b ' to indicate the time step at which an operation was initiated. We observe that both initiations require stage 2 at time step 1. Thus, even though 4 is a permissible latency for the given pipeline (according to the classical pipeline theory), it causes a collision under modulo scheduling with an II = 6. The collision occurs in the software pipelined cycle due to the wrap-around resource usage. Classical pipeline theory does not address such collisions which are caused mainly by the wrap-around resource usage. The latency cycle f1; 5g with a period 6 \matches" with given initiation interval II = 6. Two initiations started at time steps, say 0 and 5, does not cause any collision as shown in Figure 2(b). Under this initiation, stages 1 and 2 of the pipeline are used in 4 out of the 6 cycles, i.e. have an utilization of only 66.67%. Suppose there are three instructions i , i , and i in a loop and II = 6, can all these three instructions be initiated in the same pipeline? Since each stage of the pipeline is used in at most 2 cycles, it may be possible to initiate 3 operations in 6 cycles. However, none of the latency cycles of this pipeline allow three initiations in 6 cycles. Thus, for the given reservation table, one has to resort to a higher II in order to accommodate the 3 initiations in a single pipeline. For the given reservation table, 3 initiations are possible in the pipeline only for values of II 9. This increase in II corresponds to a decrease in the computation rate of the software pipelined loop by 50%. This gives raise to the two questions which motivate our work on MS-pipeline theory. 1

2

3

Stage 1 2 3 4

0 0b 4

Time Steps 1 2 3 4 5 4 0 b4 0,4 0 4 0 4 0 (a)

Stage 1 2 3 4

0 0b 5

Time Steps 1 2 3 4 5 5 0 5b 0 5 0 5 0 5 0 (b)

Figure 2: MRTs Corresponding to Initiations with Di erent Latencies 7

2.3 Problem Statement The previous subsection shows that latency cycles obtained from classical pipeline theory may or may not be permissible under modulo scheduling. The problem is more involved as the initiation interval of di erent loops could be di erent. Secondly, how does one identify latency cycles that yield higher throughput and are permissible for a given II under modulo scheduling? Lastly, is it at all possible to \adjust" the reservation table of the pipeline to accommodate 3 initiations? Patel and Davidson [17] have addressed a similar issue in the case of hardware pipelines. By insertion of delays (non-compute stages) in the pipeline, they have proposed a method that recon gures the pipeline to achieve higher throughput or utilization. Hence, the last question really is, can their method [17] be adapted to modulo scheduled pipelines (MS-pipelines). The rst two questions have been partly addressed in our earlier work on Co-Scheduling [9]. In this paper, we develop the theory of MS-pipelines which is useful to easily identify permissible latency sequences. Further, the MS-pipeline theory helps to address the last question, which can be formally stated as: Given a pipeline structure (its reservation table) and an initiation interval II, is it possible to improve the utilization of the pipe by delay insertion? And how?

A more speci c version of the latter problem is: Given a pipeline structure (its reservation table), is it always possible to realize the theoretical upper bound (UB Init) on the number of initiations for a given II, using the delay insertion method?

This question is important to us for the following reason. Informally, UB Init for the given pipe when II equals 6 is b6=2c = 3, since each initiation requires stage 1 (and 2) of the pipeline for 2 time cycles. Thus, if we could realize the upper bound, i.e. 3 initiations in 6 cycles, then three instructions of a loop can be scheduled without increasing the II. This is indeed possible in the modi ed reservation table shown in Fig. 3(a), where a delay represented by 'd' was introduced at time step 3 in stage 2. Fig. 3(b) shows the MRT, with 3 initiations at time steps 0, 2, and 4. 8

Stage 1 2 3 4

Time Steps 0 1 2 3 4 5 x x x d x x x (a)

Stage 1 2 3 4

0 b0 2 4

Time Steps 1 2 3 4 4 2b 0 b4 0 4 2 0 0 2 2 4 (b)

5 2 4 0

Figure 3: Modi ed Reservation Table Supporting 3 Initiations In the rest of this paper, we address the problems stated in this section by developing a new representation for the state diagram of MS-pipelines. The new representation is helpful in identifying and establishing the necessary and sucient conditions for achieving a given number of initiation in the MS-pipelines. We develop a method for delay insertion, by extending the one proposed by Patel and Davidson [17]. Finally, we establish that it is always possible to achieve UB Init by introducing delays in the reservation table.

3 Theory of Modulo-Scheduled Pipelines In this section, we begin with a brief review of the basic concepts of MS-pipelines [9]. Then we develop a method for constructing the state diagram of MS-pipelines and establish its correctness. Throughout this paper, we consider only static [12] pipelines, whose resource usage pattern can be described by a single reservation table. Thus each (static) pipeline of an architecture (e.g. FP Add, FP Multiply, FP Divide) needs to be analyzed independently and the results of these analyses can be used in constructing the software pipeline schedule involving these di erent operations.

3.1 Review of Basic Concepts As noted earlier, the reservation table of a hardware pipeline is represented by an m  l reservation table where m is the number of stages in the pipeline and l is the execution 9

time (latency) of an operation executing on the FU. Let dmax denote the maximum number of cycles for which any stage of the pipeline is needed. To represent the resource usage of MS-pipelines under the modulo scheduling framework, we extend the reservation table to a cyclic reservation table (CRT). The CRT is obtained as follows. (1) If l < II, the reservation table is extended to II columns (with the additional columns all empty). (2) If II < l, the reservation table is folded. An X mark at (s; t) in the original reservation table appears at time step t mod II in the s-th row of the folded reservation table . (3) If II = l, nothing need be changed. As an example, the reservation table in Fig. 3(a) yields the cyclic reservation tables shown in Fig. 4(a) and Fig. 4(b) for II = 4 and II = 7. 2

Stage Time Steps 0 1 2 3 1 x x 2 x x 3 x 4 x (a) Cyclic Reservation Table for II = 4.

Stage

Time Steps 0 1 2 3 4 5 6 1 x x 2 x x 3 x 4 x (b) Cyclic Reservation Table for II = 7.

Figure 4: A Cyclic Reservation Tables for Di erent IIs Next we de ne forbidden and permissible latencies for MS-pipelines.

De nition 3.1 A latency f  II is said to be a forbidden latency if there exists a row s in the CRT such that both (s; t) and (s; (t + f ) mod II) of the CRT contain an X mark. A latency f that is not forbidden is termed permissible. It can be easily seen that in an MS-pipeline, a latency values f greater than II is equivalent to f mod II. Further, if f is a forbidden latency, then II ? f is also forbidden. The latency With this folding, multiple X marks separated by II may be placed in the same column of the CRT. However, fortunately, the modulo scheduling constraint already prohibits such occurrences. When the modulo scheduling constraint is violated, delays can be introduced to rectify the problem [9]. Thus the cyclic reservation table will not have two X marks on the same column of the CRT. 2

10

value 3 is forbidden in the CRT in Fig. 4(b). Hence 7 ? 3 = 4 is also in the forbidden latency set. Hence, the permissible latency set is f1; 2; 5; 6g. The following property is satis ed by the permissible latency set.

Lemma 3.1 Let S = p ; :::; pk be the permissible latency set for an MS-pipeline with initiation interval II. If p ; p ;    ; pk are in (strictly) ascending order, then 0

1

1

2

II = p + pk = p + pk? =    = pd k2 e + pd k+12 e 1

2

(1)

1

Proof: Since p ; p ; :::; pk are in the ascending order, then II?p ; II?p ;    ; II?pk are in the descending order. Also, pk ; pk? ;    ; p ; p is also a descending sequence. By the de nition of permissible latencies, if pi is permissible then II ? pi is also permissible. Further, as p is the smallest permissible latency, II ? p must correspond to the largest permissible latency, 1

2

1

1

1

2

1

1

1

which must be pk . That is,

II ? p = pk or II = p + pk 1

1

Likewise, II ? p is the second largest permissible latency and hence corresponds to pk? . Therefore, II ? p = pk? or II = p + pk? 2

1

2

1

2

1

Likewise the other equalities, represented by    in Equation 1 can be established. Lastly, if k is odd, pdk= e = pd k = e is the middle element. Otherwise pdk= e and pd k = e, are the two middle elements. Hence, 2

( +1) 2

2

II = pd k= e + pd (

2)

k

= e:

(( +1) 2)

( +1) 2

2

As an example, the initial permissible latency of the CRT of Figure 4(b) is f1; 2; 5; 6g and II = 7. We have 1 + 6 = 2 + 5 = 7 = II: From now on, the initial permissible latency set is always assumed to be in the ascending order. An upper bound on the number of initiations (UB Init ) possible in an MS-pipeline was established in [9].

Theorem 3.1 [GovindarajanAltmanGao [9]] The upper bound on the number of operations (UB Init) that can be initiated in an MS-pipeline during II cycles is %! $ II UB Init = min (k + 1); dmax

11

where k is the cardinality of the permissible latency set and dmax is the maximum number of X marks in any row in the reservation table.

The reader is referred to [8] for a proof of this theorem. While it is easy to understand why II e, the rst bound due to k deserves some explanation. UB Init is bounded above by d dmax In an MS-pipeline, each initiation must start at a time step that corresponds to one of the permissible latencies. Another argument for this will be given in Lemma 4.1. What Theorem 3.1 gives is only an upper bound on the number of initiations. With the background described in this subsection, we are now ready to develop a method for the construction of the state diagram for MS-pipelines, called the MS-state diagram. The MSstate diagram is useful to identify the permissible latency sequences and hence the maximum number of initiations (referred to as Max Init ) that is actually achievable in the MS-pipeline.

3.2 Modulo-Scheduled (MS) State Diagram The MS-state diagram is somewhat similar to, but di erent from, the state diagram representation in classical pipeline theory [12]. The di erence between the two will be discussed subsequently.

Procedure 3.1: Construction of MS-State Diagram: Step 1 The initial state S of the MS-State diagram contains the (initial) permissible latency set S = fp ; p ;    ; pk g. We will use the state name, e.g. S , itself to represent the 0

0

1

2

0

permissible latencies in the given state.

Step 2 For each permissible latency pi in the current state S , there is an arc from S to

a new state S 0. S 0 represents the state with a new initiation pi cycles after state S . Also, the set S 0, computed as below, represents the set of latencies at which a further initiation can be started from state S 0. The permissible latencies in the new state is given by S 0 = S?pi \ S where S?pi is de ned as 0

S?pi = f(pj ? pi) mod II j pj 2 S g: Some explanation of Step 2 in the construction of the MS-state diagram may be required to have a clear understanding of the state diagram. The set S?pi is obtained by subtracting 12

pi , the chosen latency, from each permissible latency pj in S . The subtractions are performed modulo II. Intuitively, the set S?pi is the set of latencies, that may be permissible from the new state S 0. However, for a latency l to be permissible in the new state S 0, l must be in the (initial) permissible latency set S . Thus the set of permissible latencies in the new state S 0 is the intersection of S and S?pi . 0

0

s0

{1,2,5,6}

5 1

s1

s2

{1}

s3

{1, 5} 1, 5

1

2

6

2, 6

s5

s4

{2, 6}

{6}

6

{}

Figure 5: MS-State Diagram As an example, consider the CRT shown in Fig. 4(b) with II = 7. The initial set of permissible latencies is S = f1; 2; 5; 6g. The MS-state diagram is shown in Fig. 5. To make things clearer, we have illustrated in Figure 6 how the MRT would look after each initiation. Note that the latencies adjacent to each arc in the gure represent the latency from the current state (or cycle). For example, consider an initiation with a latency 1 (i.e. the transition from state S to S ). The subtraction of 1 from f1; 2; 5; 6g results in S? = 0; 1; 4; 5, the possible permissible latencies from the current state. Of these, 0 and 4, which are in the initial forbidden latencies, are not permissible since, by the de nition of forbidden latency, a new initiation started 4 cycles later will cause a structural hazard with the current initiation. This can be seen by trying a new operation 4 cycles later, i.e. at time step 5, in the MRT M . For the same reason, we cannot include any other forbidden latencies as a permissible latency in this state. Note that we have shown the MRTs (in Figure 6) in the MS-state diagram only for the purpose of explanation. Subsequently, the states will be represented by the set of permissible latencies in the current state. Further, note that di erent MRTs may correspond to a single state in the MS-state diagram. For example, the MRTs M and M , correspond to state S which represents the state where no further initiations is possible. Lastly, for clarity sake, Figure 6 is left incomplete. 0

0

2

1

2

5

6

5

p2 pk p1 S    ?! Sk in the MS-state diagram corresponds to a sequence S ?! A path S ?! 0

1

2

13

M0

0

0 0

0 0 0 6

5 M1

M3

0

5 0

5

0

1

5

5 0

0 5

0 6

2

5

6 0 6

0 6

6 0

0

0

6

M2

M4

0

0

1 1 0

0 1 0

1 0

2 2 0

1 0

1

0 2

0

1

0 6 1

2

6 M5

1 6 0 1 6 0

0 2 0

5 0 6

6

2

2

1

0

1 0

1

6

0

M6 0

6

8 0

2 8 0

0 2 8

2

1

8 0 2

2 8

2

0 8

Figure 6: Modulo Reservation Tables for the states of MS-State Diagram of initiations which are permissible. The latencies, p ; p ;    ; pk associated with a path correspond to the latencies between successive initiations. Successive initiations are made at time steps 0; p ; (p + p ) mod II;    ; (p + p    ; p k ) mod II; we refer to these values as the o set values from the rst initiation made at time 0. For example, the path S ?! S ?! S corresponds to initiations at o set values 0, (0 + 1) mod II, and (0 + 1 + 5) mod II. Henceforth, without loss of generality we always assume: (i) the rst initiation is made at time 0 and (ii) o set values are speci ed in modulo II. 1

1

1

2

1

2

2

( )

0

2

5

1

6

The number of initiations made corresponding to a path in the MS-state diagram equals L(P ) + 1, where L(P ) represents the length the path P in terms of the number of arcs. There can be several paths from S to Sk . As we are interested in maximizing the number of initiations in a pipeline, we consider in the longest path from the initial state S . Lastly, we say that a node S is at a distance d if the length of the (longest) path from S is (d ? 1). 0

0

0

De nition 3.2 The nal state of an MS-state diagram is one which contains an empty permissible latency set.

14

The longest path from the initial to nal state in the MS-state diagram corresponds to the maximum number of initiations (Max Init ) achievable in the MS-pipeline. In general, Max Init  UB Init . That is, in some MS-state diagrams the Max Init achievable may be less than UB Init . Lastly, from this longest path (to the nal state) one can identify the o set values at which these initiations can be made. This information on the o set values can be used in a modulo scheduling algorithm to construct software-pipelined schedules. It was demonstrated in [9] that the above analysis can be successfully applied in modulo scheduling. Such an approach facilitated obtaining schedules with lower initiation interval and also constructing them in shorter compilation time. Next we will establish certain properties of MS-state diagram.

3.3 Properties of MS-State Diagram The state diagrams constructed using classical pipeline theory [12] uses collision vectors to represent the set of permissible latencies in the current state. In our representation, instead, the set of permissible latencies itself is used directly. In addition to the above di erence, there is one other di erence between the MS-state diagrams and the state diagrams of [12]. In the construction of state diagram [12], the collision vector was rst left-shifted by pi positions (see Section 2.1). This is done to discount possible collisions due to the resource usage (of the earlier initiation at state S ) in the rst pi cycles. However, in MS-state diagram one has to consider the resource usage of the rst pi cycles as well, as the earlier initiation that lead to S repeats every II. Hence, in the construction of the MS-state diagram, instead of a shift-left operation a rotate-left operation need to be performed. Loosely speaking, the rotate-left operation prevents new initiations happening at the same time steps in the modulo schedule. First we will establish the correctness of MS-state diagram. 3

Theorem 3.2 The latency set associated with any state S in the MS-state diagram represents all permissible latencies in that state, taking into account all initiations made to reach the state S .

In [9] a similar representation of MS-state diagram involving collision vectors was developed. However, we prefer the new representation proposed in this paper as it is directly useful in establishing some of the properties of the MS-state diagram. 3

15

Proof: This theorem is proved using induction on the length of the (longest) path of state S .

For the initial state S , the theorem is obviously true. Assume it holds for any state having a path of length less than or equal to k. Let S be a state at a distance k. Let there be an arc from S to S 0 labeled pi. That is, S 0 can be reached from S with an initiation pi cycles after S . According to the inductive hypothesis, pi is a permissible latency in S . Thus, the arc from S to S 0 with a latency pi represents a valid initiation. After this initiation, corresponding to each permissible latency pj 2 S p0j = (pj ? pi ) mod II becomes a may be permissible latency in the new state S 0. The role of mod operation in the above equation is obvious for all pj < pi . The latency p0j in S 0 is permissible only if p0j is in the initial permissible latency set S . Thus, the intersection (with S ) in Step 2 of the construction procedure of the MS-state diagram (Procedure 3.1) guarantees that p0j is a permissible latency in S 0. 0

0

0

To complete the proof, we need to show that every latency that is permissible in state S 0 is included in the permissible latency set S 0. Assume p is a permissible latency in state S 0, but is not included in the latency set S 0. either (i) (p + pi)mod II 2= S or (ii) p 2= S . But, p is a permissible in the state S 0, p must be in S . Hence (i) must hold. The latency value p in state S 0 corresponds to (p + pi) mod II in state S . However, since p is permissible in S 0, (p + pi) mod II must be permissible in S . But, by (i) (p + pi) mod II 2= S . Thus, there is a latency that is permissible but not included in the permissible latency set in a state at a distance k. This contradicts the inductive hypothesis. 2. 0

0

In an MS-state diagram, as we go from the start state to the nal state, the number of permissible latencies monotonically decreases, i.e., only fewer initiations are possible.

Lemma 3.2 If there is an arc from S to S 0 in the MS-state diagram, then jS j > jS 0j, where jS j represents the cardinality of the permissible latency set associated with S . Proof: Let pi be the latency associated with the arc from S to S 0. From Step 2 of Proce-

dure 3.1, and the de nition of S?pi ,

jS 0j = jS?pi \ S j  jS?pi j = jS j 0

That is, jS j  jS 0j. But we need to show strict inequality. For this, consider the latency pi in S . This latency translates to pi ? pi = 0 in S?pi . Further 0 is not a permissible latency 16

for any single function pipeline. Thus, clearly, the latency corresponding to pi 2 S , does not belong to S 0. Hence jS j > jS 0j. 2

Lemma 3.3 There are no directed cycles in the MS-State diagram. Proof: The proof of this lemma is by contradiction. Assume that there is a directed cycle in the MS-state diagram involving S ; S ;    Sk ; S . By Lemma 3.2, 1

2

1

jS j > jS j >    > jSk j > jS j 1

2

1

which is impossible. 2

Lemma 3.4 Every MS-state diagram contains a nal state. Proof: The proof of this lemma follows from the fact that the cardinality of the permissible latency set associated with successive states along a directed path decreases. 2

Lemma 3.5 The construction of the MS-State diagram (Procedure 3.1) terminates after

a nite number of steps.

Proof: The proof of this lemma follows from Lemma 3.2, 3.3 and 3.4. One can verify that Lemma 3.2 to 3.4 hold for the MS-state diagram shown in Fig. 5.

4 Achieving Maximum Initiations in MS-Pipelines The construction method for MS-state diagram described in the last section is useful in identifying Max Init while Theorem 3.1 (in Section 3.1) gives the an upper bound UB Init for the number of initiations in an MS-pipeline. The natural questions then are: (1) Under what condition(s) does Max Init equals UB Init ? (2) When Max Init < UB Init , is it possible to improve the Max Init ? We answer the rst question in this section by analyzing the MS-state diagram. As will be seen later this analysis provides useful guidance to pipeline designers. Section 5 deals with the second question. 17

4.1 Cardinality of the Permissible Latency Set First, we de ne a special form arithmetic progression.

De nition 4.1 An arithmetic progression (A.P) of the form p; 2p;    ; k:p is called a special form A.P. In a special form A.P. the di erence between two successive elements is same as the rst element of the A.P. For example the sequence (3; 6; 9) is a special form A.P. Special form A.P.s help to characterize the performance of MS-pipelines as shown in the following theorem. The following theorem establishes the necessary and sucient condition for achieving the upper bound on the number of initiations governed by cardinality of the permissible latency set. In doing so, the theorem also provides practical guidance to hardware designers in getting maximum utilization from their pipelines. In the following discussion we assume that the cardinality is less than or equal to b(II=dmax )c.

Theorem 4.1 Assume an MS-pipeline with a modulo initiation interval II and the cardinality of the permissible latency set k. Then (k + 1) initiations are possible in this pipe if and only if the initial permissible latency set forms a special form A.P. Let us consider an example. Let S = f3; 6; 9g and II = 12. The MS-state diagram shown in Figure 7 has the longest path with path length equal to 3. 0

Proof: [If-part] It is given that initial permissible latency set forms a special form A.P, say S = fp; 2p;    ; k:pg. We need to show that there exists a path of length k. The proof p p p is by construction of the path S ?! S ?! S    ?! Sk . First, choose p as the latency 0

0

1

2

from S to reach S . One can easily verify that 0

1

S = S ?p \ S = f0; p; 2p;    ; (k ? 1)pg \ S = fp; 2p;    ; (k ? 1)pg: 1

0

0

0

Again we choose p as the latency to reach state S from S . Then S = fp; 2p;    ; (k ? 2)pg. Proceeding this way, Sk? = fpg and Sk = f g. 2 2

1

2

1

We prove the [only-if] part in the following way. First we show that when (k + 1) initiations are possible, each one of these initiations start at a di erent time step in the 18

{ 3, 6, 9 } 3

6

{ 3, 6 }

9

{ 3, 9 } 3, 9

6

3 {3}

{ 6, 9 } 6

9

{6} 6

3

{9} 9

{ }

Figure 7: MS-State Diagram for Permissible Latency Set = f2; 4; 6; 8g MRT; further the each o set value of these initiations equal to a unique permissible latency. Using this result we will show that two permissible latencies pi and pi are separated by p ; i.e., for all pi, p i ? pi = p . From this, it follows that the permissible latencies form a special form A.P. Next, we de ne a CL-path as follows. +1

1

1

( +1)

De nition 4.2 A path in the MS-state diagram with path length equal to the cardinality of the (initial) permissible latency set is said to be a Cardinal Length (CL) path. Note that an MS-state diagram may or may not have a CL-path. Theorem 4.1 states that an MS-state diagram has a CL-path if and only if the initial permissible latency set is in a special form A.P. From the if-part of Theorem 4.1, it is also possible to show that every path in the MS-state diagram whose initial permissible latency set is a special form A.P., is a CL-path; but we will skip the proof. The following lemma shows that the o set values corresponding to any path in the MSstate diagram form a subset of the permissible latency set. Further, it shows that no two o set values are equal. As an example, consider the path S ! S ! S with l = 6 and l = 2. Here II = 7. Clearly, l = 6 2 S and 6

0

2

1

0

(l + l ) mod II = 8 mod 7 = 1 2 S 1

2

Further, l = 6 is distinct from (l + l ) mod II = 1. 1

1

2

19

0

3

2

5

1

Lemma 4.1 Let S = fp ; p ;    ; pk g and there be a path S !l1 S !l2    !lr Sr in the 0

1

2

0

1

MS-state diagram. Then

(l + l ) mod II = pi2 ;

l = p i1 ; 1

1

   (l + l +    + lr) mod II = pir

2

1

2

where i1; i2;    ; ir are distinct index values taken from the set [1; k].

Proof: For each initiation Si, the corresponding o set value is Oi = (l + l +    + li) mod II; 1

2

Now Oi must equal some permissible latency pj . Otherwise, the latency between Si and S l1 l2 is forbidden, which contradicts the fact that S ! S !    !lr Sr is a path in the MS-state diagram. Further, the o set values Oi and Oj corresponding two initiations i and j must be distinct; otherwise it violates the modulo scheduling constraint { each initiation is repeated once every II cycle. Hence the Lemma. 2 0

0

1

Corollary 4.1 If there exists a CL-path S !l1 S !l2    !lk Sk in the MS-state diagram, 0

1

then the set containing the o set values of the above initiations is a permutation of S0.

This is a special case of the previous lemma with r = k. 2

Lemma 4.2 If S = fp ; p ;    ; pk g is the initial permissible latency set, and there exists a CL-path in the MS-state diagram, then for any i; j 2 [1; k] and i = 6 j , (pi ? pj ) mod II 2 S . 0

1

2

0

Proof: Since there exists CL-path in the MS-state diagram, from Corollary 4.1, the o sets

of the last k initiations (from the rst initiation at time 0) is a permutation of the permissible latency set. That is, there is one initiation at an o set value equal to each of the permissible latencies. Thus for any two initiations, with o sets pi and pj , i 6= j , to be valid, i.e. does not cause a collision, the latencies between these two initiations must be permissible. That is, (pi ? pj ) mod II must be in the initial permissible latency set. 2 Note that the above lemma holds only for MS-state diagram having a CL-path. For example, if S = f2; 3; 5; 7; 8g then the di erence between the permissible latencies 3 and 7. i.e. 7 ? 3 = 4 does not belong to S . As the last step in the preparation for proving the [only if]-part of Theorem 4.1, we establish the following lemma. 0

0

20

Lemma 4.3 Let S = fp ; p ;    ; pk g. If (pj ? pi ) mod II 2 S for any pi 2 S , and for all pj 2 S such that pj 6= pi, then p i ? pi = p . 0

1

2

0

0

0

1

( +1)

Proof: It is given that all (pj ? pi ) mod II 2 S . In particular, (p i ? pi) mod II 2 S . Since the elements of S are in the ascending order, the elements (p i ? pi ), (p i ? pi),   , (pk ? pi) are also in the ascending order. The modulo operation with II is omitted here since, for all j  i, 0  (pj ? pi)  II. Further, since each pj 2 S is unique, the above 0

0

( +1)

0

( +1)

( +2)

0

elements are also unique. We know that pk < II. Subtracting pi from both sides we have,

pk ? pi < II ? pi But from Lemma 3.1, p k?i (

+1)

+ pi = II. Thus substituting p k?i (

+1)

for II ? pi, we have,

pk ? pi < p k?i (

+1)

Now, since pk ? pi is a permissible latency, and p k?i < p k?i , (

)

(

+1)

pk ? pi  p k?i (

)

Since p k? < pk , using p k? in the place of pk we get, (

1)

(

1)

p k? ? pi < p k?i : (

1)

(

)

As p k? ? pi is a permissible latency, and, p k?i? < p k?i , we get (

1)

(

1)

(

)

p k? ? pi  p k?i? (

1)

(

1)

Proceeding further this way,

pi

( +1)

? pi  p i

i.e. p i

?i) ;

( +1

( +1)

? pi  p

1

But p is the smallest permissible latency in S and therefore p i ? pi = p : 2 Now we are ready to prove the [only-if]-part of Theorem 4.1. Proof: [Only-if part of Theorem 4.1] Since there are k + 1 initiations possible, there must exist a CL-path k in the MS-state diagram. Then by Lemma 4.2, the di erence (modulo II) between every pair of elements in S , is also in S . Now choosing pi to be one of p , p ,   , pk? and applying Lemma 4.3, we get 1

0

0

( +1)

1

0

1

1

p ?p =p ; 2

1

which completes the proof. 2

1

p ? p = p ;    ; pk ? p k ? = p ; 3

2

1

21

(

1)

1

2

4.2 A Sucient Condition Theorem 4.1 states that in order to achieve UB Init , when it is governed by the rst bound in Theorem 3.1, the initial permissible latency set must be a special form A.P. This is a strong result; however its applicability is limited as it requires all permissible latencies to be in a special form A.P. In this subsection we present a weaker, but more useful, result which requires only a subset of the initial permissible latencies to be a special form A.P. Having such a subset is a sucient condition for achieving a given number of initiations. Consider the initial permissible latency set S = f3; 4; 6; 7g and II= 10. S contains subsequence f3; 6g, of cardinality 2, which forms a special form A.P. Hence there exists a path S ?! S ?! S which corresponds to at least 2 + 1 = 3 initiations. The following theorem formalizes this idea. 0

0

3

1

3

0

2

Theorem 4.2 If the initial permissible latency set S contains a special form arithmetic 0

progression of length l, then at least (l + 1) initiations are possible in the MS-pipe.

Proof: The proof of this lemma is similar to the proof of the [if-part] of Theorem 4.1. By considering only those permissible latencies which form the special form A.P. it is possible to construct a path of length l in the MS-state diagram. Hence the Theorem. 2

Theorem 4.2 provides only a sucient but not necessary condition. The following example illustrates this. Assume II = 10, dmax = 2 and the initial permissible latency set SO = f2; 3; 5; 7; 8g. It can be seen that there exists a path

S ?! S ?! S ?! S 0

3

1

2

2

3

3

in the MS-state diagram and thus Max Init = 4. However the longest special form A.P. contained in the permissible latency set is of length 1. (Trivially every element forms a special form A.P. sequence of one element.)

5 Delay Insertion in MS-Pipelines In this section we address the questions relating to improving Max Init towards the upper bound. We adapt the delay insertion method of Patel and Davidson to MS-pipelines to 22

improve Max Init . We show that, with the delay insertion method, it is always possible to realize the upper bound due to dmax, namely bII=dmaxc. It is important to note here that the cardinality of the permissible latency set no longer plays a role in determining UB Init . This is because, the permissible latency set, and hence its cardinality, changes with delay insertion.

5.1 Delay Insertion to Improve Number of Initiations In our motivating example, we saw that at most two initiations can be made for the reservation table shown in Fig. 1 for an II of 6; i.e., Max Init = 2. However, UB Init for that reservation table is 3. The question is can we somehow modify the reservation table to accommodate 3 initiations in the same pipe in 6 cycles. It was also seen that the modi ed reservation table shown in Fig 3(a) can, indeed, realize a Max Init of 3. We develop a systematic approach to this problem in this section. By Theorem 4.2, (m + 1) initiations can be guaranteed if the permissible latency set S contains a special form A.P. of length m. Thus if we can adjust the permissible latency set to contain an m-length special form A.P. then we are done. It must be noted that the existence of a special form A.P. is only a sucient but not necessary condition. Thus it may be possible to achieve (m + 1) initiations without S containing an m-length special form A.P. For example, the MS-pipeline discussed in Section 4.2 with an initial permissible latency S = f2; 3; 5; 7; 8g supports 4 initiations, though its permissible latency set contains only a special form A.P. of size 1. In this paper we do not consider such alternative possibilities. 0

0

0

Theorem 4.2 suggests that if the permissible latency contains a special form A.P. of length 2, then 3 initiations can be guaranteed. Let the A.P. be fp; 2pg. Clearly, 2:p < II; i.e. p < b(II=2)c = b6=2c = 3 In other words, we need to have an A.P. fp; 2pg in the permissible latency set for any value of p in the range 1 to 2. For example when p = 1, let us try to include the A.P. f1, 2g in permissible latency set. Let Pi = f1; 2g denote the A.P. If the permissible latency set (denoted by P ) has to include the elements 1 and 2 then it will also include their complements II ? 1 = 5 and II ? 2 = 4, Thus P = f1; 2; 4; 5g is a permissible latency set 23

that will guarantee at least 3 initiations. From this, the forbidden latency set F is f0; 3g. Thus, we need to modify the reservation table such that 0 and 3 are the only forbidden latencies. Patel and Davidson's method suggests that any row of the modi ed reservation table should have X mark in columns represented by the compatibility classes or derived compatibility classes of the forbidden latency set. We restate the de nitions of compatibility and derived compatibility classes for the sake easy reference.

De nition 5.1 Two elements f and f of a set F are compatible if jf ? f j is in F . 1

2

1

It can be seen that if jf 1 ? f 2j is in F , then (f 1 ? f 2)

mod

2

II is also in F .

De nition 5.2 A compatibility class with respect to F is a set in which all pairs of elements are compatible.

An algorithm to compute all the compatibility classes is given in [12] (page 99).

De nition 5.3 If C = fc ; c ;    ; cn g is a compatibility class, then C 0 = f(c +I ) mod II; (c + I ) mod II;    ; (cn + I ) mod IIg is a derived compatibility class for any integer I . As an example, if F = f0; 3; 4; 5; 6; 7g, then f0; 3; 6g, f0; 4; 7g and f0; 5g are compatible classes and f1; 4; 7g is a derived compatible class. The forbidden latency set f0; 3g is itself a compatibility class. Thus, if we choose the compatibility/derived compatibility classes f0; 3g, and f1; 4g to represent the rst two rows of the CRT respectively, we obtain the modi ed 1

2

1

2

reservation table shown in Figure 3(a). The insertion of delays in the reservation table may increase the execution time of an operation ; this in turn may increase RecMII, which may a ect II. Thus the above method of modifying the reservation table is applicable only when the introduction of delays does not a ect II. This is possible either (1) when the loop does not contain recurrence cycles involving operations that are executed in this pipeline; or (2) when ResMII dominates RecMII. Procedure 5.2 formally describes this delay insertion method to achieve a given number of initiations. Procedure 5.1: Delay Insertion to Achieve (m + 1) Initiations: 4

An increase in the execution time will also increase the live ranges of variables which in turn may increase the register pressure. We do not consider such e ects in this paper. 4

24

Step 1 Derive the MS-State diagram described in Procedure 3.1. If the length of the longest path l equal m, then the given CRT supports (m + 1) initiations; Go to Step 4.

Step 2 For each p from 1 to (bII=mc ? 1) do Step 2.1 fp; 2p;    ; mpg be a subset of permissible latencies. Step 2.2 The permissible latency set P = fp; 2p;    ; mp; II ? p;    ; II ? mpg. Step 2.3 Compute the forbidden latency set F . F = f0; 1; 2;    ; II ? 1g ? P ; Step 2.4 Derive all the compatibility classes of F using the procedure described in [12] [page 99].

Step 2.5 If size of largest compatibility class less than dmax the maximum number of X

marks in any row, try next value of p; go to Step 2.1.

Step 2.6 For each row of the CRT choose an appropriate compatibility or derived compatibility class. Modify the CRT to match the chosen compatibility class. Go to Step 4.

Step 3 Report failure to support (m + 1) initiations. Exit. Step 4 Report modi ed CRT that support P . End. One remaining question is: Does our method (Procedure 5.1) always succeed? In other words, is it always possible to support (m + 1)  bII=dmaxc initiations in a CRT? We answer this question armatively in the following section.

5.2 Achieving the Upper Bound on Maximum Initiations We establish that UB Init can always be achieved, by introducing sucient delays in the CRT.

Theorem 5.1 Given an initiation interval II, a CRT with at most dmax X marks on each row, it is always possible to achieve u = bII=dmax c initiations by introducing delays in the CRT.

25

Proof: We prove this theorem by constructing a modi ed CRT that supports the set f1; 2;    ; (u ? 1)g, as a (sub)set of permissible latencies. We will only prove the theo-

rem for the row(s) the CRT that consists dmax X marks. Other rows can also be modi ed in a similar way to support same permissible latency (sub)set. Place the X marks at cycles 0; u; 2  u;    ; (dmax ? 1)  u: Thus the forbidden latency set contains F 0 = f0; u; 2  u;    ; (dmax ? 1)  ug: It can be easily seen that the distance between any pair of X marks is already in the above subset of forbidden latencies F 0. But, by the de nition of forbidden latency, if f is forbidden, then II ? f is also forbidden. Hence the following subset F 00 is also contained in the forbidden latency set. F 00 = f(II ? u); (II ? 2  u);    ; (II ? (dmax ? 1)  u)g: Thus, the complete forbidden latency set for this CRT is F = F 0 [ F 00: Note that the terms in F 0 are in the ascending order while that in F 00 are in the descending order. Hence, if we can prove that the last term in F 00 is greater than the rst term in F 0, then all latencies in [1; u ? 1], are permissible. Since this forms a special form A.P. of length (u ? 1), at least u = bII=dmax c initiations are guaranteed by Theorem 4.2. Thus, to complete the proof, we need to show that

II ? (dmax ? 1)  u  u: Consider

% II  II ? (dmax ? 1)  dII = dII L:H:S = II ? (dmax ? 1)  d max max max $

Introducing the oor function on the R.H.S,$

% II L:H:S  d =u 2 max

Theorem 5.1 shows that by placing the X marks at u = bII=dmaxc distance apart, we can always achieve u initiations. This also ensures that Procedure 5.1 will always succeed executing the iteration in Step 2 only once. That is, it will always be able the CRT supporting permissible latencies (1; 2;    ; u). In Section 7 we present an experimental study that evaluates the usefulness of MS-pipeline theory and the delay insertion method. In the following section a software pipelining method that makes use of MS-pipeline theory is described. 26

6 Application of MS-Pipeline Theory in Software Pipelining Here we brie y describe how the proposed MS-pipeline theory can be applied to a software pipelining algorithm. In this discussion we will concentrate how to achieve the modi ed reservation table that matches well for the current loop. This implies that the resource usage pattern of di erent pipelines are speci cally tuned for each loop. Such an approach is more appropriate for high-level synthesis of application-speci c systems. However, modifying the resource usage patterns at runtime in modern processors is quite involved and is currently not supported. A simple but e ective way to achieve the e ect of modifying the resource usage pattern is to support multiple versions for each instruction class in the instruction set architecture. Our work on MS-pipeline theory is directly useful here and helps the microarchitecture designer to identify which reservation tables are best suited for various values of II. Experiments similar to those discussed in the following section (Section 7) can be conducted by the microarchitecture designer in deciding the di erent versions of an instruction class. In this discussion we present a simple software pipelining method that chooses the most appropriate version of di erent instruction classes so as to eciently schedule the given loop. The software pipelining method is based on Hu 's slack scheduling algorithm [10]. We adapt this method to (i) make use of the MS-state diagram in deciding on which time steps an operation can be initiated, and (ii) carefully choose the appropriate version for each instruction class that best ts the given loop. An algorithmic description of our method is given in Procedure 6.1. To make the discussion concise and address only the relevant points, we assume that the architecture consists of only one pipeline in each FU type (or equivalently each instruction class). However, let us assume that multiple versions are supported in each instruction class. For each instruction class and for each version of the instruction, construct the MS-state diagram for the given II. Alternatively this can be precomputed and stored in a database and made available to the compiler. For each instruction class, choose the version that supports a Max Init greater than or equal to the number of operations that are executed in that 27

particular FU type. If there are multiple versions that satisfy this condition, choose the one that satis es the condition on a maximum number of paths (in the MS-state diagram). Since each path corresponds to a set of o set values at which initiations can be made, choosing the version that has a larger number of paths increases the chances of successfully scheduling the operations for the given II. For the chosen version of the instruction class, select the paths in the MS-state diagram that supports at least as many initiations as is required for the given loop. The software pipelining algorithm will make use of the set of o set values corresponding to these paths. Di erent instruction classes, such as FP Add, FP Multiply, and FP Divide, have di erent resource usages and the speci c version chosen in each case is decided independently. The basic step of the software pipelining algorithm is to schedule instructions according to some priority. Many of the software pipelining algorithms use slack | the di erence between the earliest and latest times at which an instruction may be scheduled | as a measure for priority. While the original software pipelining algorithm attempts to schedule an instruction any time in its slack range, we suggest the use of sets of o set values (corresponding to di erent paths in the MS-state diagram) to guide the search. In particular, the schedule of an instruction is attempted only at o set values that lie within the slack range. As we proceed to schedule di erent instructions of an instruction class in the pipe, we proceed to choose a speci c path (among the set of paths that were selected initially). It is important that the o set values of the chosen path should enable the instructions to be scheduled within their slack time. Though we have presented only a simplistic view of the algorithm, it involves backtracking in the selection of an appropriate path for an instruction class. Further, the aspects of scheduling, unscheduling, and rescheduling of instructions are also there in our software pipelining method | as they are in the original method | whenever an instruction cannot be scheduled in its slack range. The only di erence being that our approach takes a global view of when instructions should be scheduled and which version of the instruction class to be used for the given II. These enable obtaining an ecient software pipelined schedule. Further, as our approach places instructions only at predetermined o set values (corresponding to a path in the MS-state diagram) that guarantee legal initiations (i.e. no collision), our approach deals with pipeline hazards in a much more ecient way than the original software pipelining methods. 28

Procedure 6.1: A Software Pipelining Method that uses MS-Pipeline theory Step 1

For each instruction class I

Step 1.1 For each version of the instruction class Step 1.1.1 Construct the MS-state diagram; Step 1.1.2 Determine the set of paths P ( I ) (and the corresponding o set values) v

v;

that supports at least N (I ) instructions, where N (I ) represents the number of instructions in the given loop that are executed in this FU type.

Step 1.2 Choose the version V

that has maximum paths in P (v; I ), i.e.

L(P (v; I ))

is

maximum.

Step 1.3 If P (V

is empty, i.e. none of the versions of the instruction class can support N (I ) instructions, then increase II by 1; go back to Step 1.

Step 2

;

I)

While there exists an unscheduled instructions, repeat Steps 2.1 to Step 2.5.

Step 2.1 If the total number of ejected instructions (of all instruction classes) exceed some threshold value (say THRESHOLD ON TOTAL EJECTED OPS), then increase II by 1 and go back to Step 1.

Step 2.2 Compute the slack and priority of the unscheduled instructions.

Choose the in-

struction with the highest priority.

Step 2.3 Attempt to schedule the instruction at a time step in its slack range. The chosen time step must correspond to an o set value supported by at least one of the paths in P (V ; I). Exclude those paths from P (V ; I) that do not support this o set value.

Step 2.4 If there are no paths in P (V

that support any of the o set value in the slack range, unschedule the last instruction of this instruction class; this somewhat corresponds to backtracking on the MS-state diagram to a previous level. Therefore the unscheduling of a scheduled instruction increases the number of available paths in P (V ; I). Go back to Step 2.4. (i.e. attempt to schedule the current instruction). This will eventually succeed, because when the algorithm backtracks to the root of the MS-state diagram, it should be possible to schedule the instruction, as it is the rst one initiated in this pipe. ;

I)

29

Step 2.5 If the number of ejections of instructions belonging to this instruction class exceed

certain threshold value (say THRESHOLD ON EJECTED OPS PER INS CLASS), unschedule all instructions of this instruction class. Choose a di erent version V 0 for this instruction and go back to Step 2.1

Step 3

End.

The above software pipelining algorithm and the proposed MS-pipeline theory can be extended to and, in fact, are more appropriate for loops with conditionals. For example, under trace scheduling or superblocks [5, 11] it may be more bene cial to use di erent versions of instruction classes for the frequently taken and not-taken paths of the trace. Techniques discussed in [23, 24] can be incorporated in our software pipelining method to handle conditionals. The details of these extensions are beyond the scope of this paper and we plan to investigate them in future.

7 Experimental Results 1. How useful is Theorem 4.2 in obtaining latency sequences that result in Max Init ? 2. How often does Max Init achieve the maximum utilization, i.e., equal UB Init ? 3. In cases where Max Init does not reach UB Init , can the delay insertion method developed in Section 5 be used to achieve higher utilization? If so, how many delay cycles may be needed? In an attempt to answer the above questions, we conducted some initial experiments. We report their results in this section. Our experiments concentrate on analyzing MS-pipelines and demonstrating their usefulness. It also present some quantitative results on the delay insertion method. Though the results of our analysis can be used in a software pipelining algorithm (e.g. the one discussed in the previous section) we do not focus on these application aspects in our experiments. First, we brie y describe our experimental setup. We considered several reservation tables which model the resource usage of di erent instructions (or instruction classes) of 30

real processors such as MIPS R-4000, MIPS R-8000 or DEC Alpha 21064. For each of these pipelines we generated the MS-state diagram for di erent II values, ranging from 8 to 64. The MS-state diagram consists of a large number (greater than 10,000) of states for large values of II. Therefore, for pragmatic reasons, we restricted the size of the MSstate diagram to a maximum of 10,000 distinct states. Our algorithm also computes the longest path (corresponding to Max Init ). When Max Init is less than UB Init , our delay insertion method (Procedure 5.1) is applied for each value of initiation between Max Init and UB Init . For each of these values, we compute how many delay cycles need to be inserted. In introducing delays, we have implemented a more realistic pipeline model in which delaying one stage of a pipeline at some time t delays all subsequent resource usages (in all stages). We collect the following statistics from our experimental setup. For the 10 reservation tables and the 56 di erent initiation intervals considered, a total of 496 combinations resulted . In all these cases we measured the above parameters. The results are tabulated in Table 1. We observe that only in 128 (26%) of the 496 test cases, Max Init equals the UB Init . Further, on the average, Max Init corresponds only to 69% (median) or 76% (arithmetic mean) of UB Init . In the remaining 74% of the cases where Max Init doesn't reach UB Init , only 70% utilization of the MS-pipelines is achieved. However, in these cases, we nd that by introducing a small delay, often 1 cycle, the upper bound can be reached in more than 57% of the cases. In other cases, with a delay of 1 cycle, upto 90% of UB Init can be realized. Thus a signi cant performance improvement can be expected with recon gurable pipelines. 5

The average value (arithmetic mean) for the delay introduced to achieve the UB Init is 9; the median value for this is 1. In order to compare the delays introduced for di erent IIs, we de ne a metric called delay ratio which is the ratio of delay cycles to II. The mean and median value for the delay ratio required to achieve UB Init are 0.26 and 0.06 respectively. Further, we noticed that the use of Theorem 4.2 was quite e ective. In 318 out of 496 test cases, the theorem obtained a latency sequence that results in Max Init . On average, in the 496 test cases considered, determining the longest special form of A.P. covers 91% of Max Init . This result, reported in the last row of Table 1, shows that even though an MS-state diagram may involve several thousands of states, determining the longest special We did not include cases where either dmax is greater than II or UB Init equals to 1. While in the former case modulo scheduling constraint was violated, in the latter UB Init was trivially achieved. 5

31

Description No.of Cases %-age Cases when Max Init equals UB Init 128 25.8 MS-State Diagram consists of more than 10000 nodes 258 52.0 Theorem 4.2 yielded a Max Init sequence 318 64.1 Description Arith. Mean Median Number of States in MS-state diagram 5664 { Ratio of Max Init to UB Init 0.76 0.69 No. of delays introduced to achieve UB Init 9.5 1 Delays ratio to achieve UB Init 0.26 0.06 Ratio of Max Init obtained from Theorem 4.2 0.91 1.00

Table 1: Statistics of MS-State Diagram and Performance of Delay Insertion Method form A.P. reveals a path in the state diagram that supports at least 90% of Max Init . This implies that the state diagram construction time can be saved in many cases if the loop consists of fewer (than 90% of Max Init ) operations to be initiated in a given pipe.

8 Related Work Classical pipeline theory was developed almost 2 decades ago and was successfully employed to improve the throughput and the utilization of pipelined and vector architectures [12, 17]. Such approaches select a latency cycle having the minimum average latency, and schedule operation in the pipeline repetitively based on the period of the latency cycle. Patel and Davidson proposed delay insertion to improve the performance of hardware pipelines [17]. Their approach introduced delays in the reservation table to support latency cycles that achieve the optimal minimum average latency. An important point to observe here is that the period of the latency cycle is solely determined by the pipeline structure. We refer to the above scheduling as cyclic scheduling of pipelines. The theory of cyclic scheduling of pipelines has been applied to cyclic job-shop scheduling in manufacturing systems [2]. Recently, ideas from hardware pipeline theory has been used to develop an FSA-based approach to perform instruction scheduling for pipelined architectures involving structural 32

hazards [16, 18, 1]. Their methods have e ectively reduced the problem of checking structural hazards to a fast table lookup, facilitating a signi cant amount of speedup in the scheduling time. The method has been applied to production compilers for MIPS processors [18], KSR (Kendall Square Research), and DEC ALPHA 21064 [1]. However, these instruction scheduling methods are for straight-line code and do not consider software pipelining. Further, these FSA-based methods follow the greedy approach, attempting to schedule an operation in the current cycle if it does not cause a structural hazard. In contrast, our approach in this paper and the work reported in [9] considers all latency sequences and choose the one that maximizes the throughput of the MS-pipeline. In contrast to cyclic scheduling of pipelines [17, 12, 2], the initiation interval (II) of Modulo-Scheduled Pipelines depends on the recurrences in the loop and the resource availability. The theory of MS-pipelines has direct application in software pipelining, as established in our Co-Scheduling framework [9]. A salient feature of the MS-pipeline theory is that it integrates scheduling constraints posed by the pipeline structure and the modulo initiation interval. To the best of our knowledge this is the rst attempt to tune the classical pipeline theory in a form suitable for software pipelining [20, 13, 10, 19, 3, 21]. The proposed theory of MS-pipelines and the methods developed in this paper to improve their performance are directly useful, as discussed in Section 6, to any modulo scheduling algorithm. In that sense, our work complements the existing work on software pipelining [20, 13, 10, 19, 3, 21]. More recent software pipelining methods [10, 4, 14] also concentrate on minimizing register pressure. The software pipelining method discussed in Section 6, like Hu 's slack scheduling method, can be life-time sensitive, i.e. can attempt to place an operation closer to either its Estart (earliest start) or Lstart (latest start) time. Lastly, as discussed in Section 6, MS-pipeline theory and the proposed software pipelining method are applicable and appropriate for loops with conditionals. They can make use of the methods discussed in [23, 24] to software pipeline loops having conditionals. Our work complements the FSA-based methods [16, 18, 1], in that we focus on software pipelining while their methods are applicable to general instruction scheduling. Our MSstate diagram considers all possible initiation sequences and can choose the one that gives the maximum resource utilization. Lastly, the theory developed in this paper is helpful to further improve the performance of MS-pipelines with the introduction of delays. 33

9 Conclusions In this paper, we have investigated the theory of MS-pipelines. In particular we have proposed a new representation for the MS-state diagram which can be analyzed to choose latency sequences that improve the number of initiations in the MS-pipeline. We have established a sucient condition, on the permissible latency set, to achieve a given number of initiations in an MS-pipeline. It was also established that, under some restricted cases, in order to achieve the maximum number of initiations in an MS-pipeline, the permissible latency set must be an arithmetic progression of a special form. This is both a necessary and sucient condition. It also provides practical guidance to hardware designers in getting maximum utilization from their pipelines. Using the suciency condition on the permissible latency set, we have developed a method, taking strong hints from [17, 12], that introduces delays in the MS-pipeline to support a given number of initiations in the pipe. Using the delay insertion method, we establish that the maximum number of initiations can always be achieved. It was shown how the theory of MS-pipelines developed in this paper can be used to any software pipelining method. Lastly, this paper reported quantitative results based on experiments conducted on 10 reservation tables representing the resource of usage of di erent instruction classes in modern RISC processors. Our experimental results show that with the insertion of a small delay, often 1 cycle, more than 90% of UB Init can be realized in all cases. This is a further evidence of the usefulness of the MS-pipeline to microarchitecture designers.

References [1] V. Bala and N. Rubin. Ecient instruction scheduling using nite state automata. In Proc. of the 28th Ann. Intl. Symp. on Microarchitecture, pages 46{56, Ann Arbor, MI, 1995. [2] J.K. Chaar and E.S. Davidson. Cyclic job shop scheduling using collision vectors. Technical Report CSE-TR-169-93, University of Michigan, Ann Arbor, MI, Aug. 1993. [3] J. C. Dehnert and R. A. Towle. Compiling for Cydra 5. J. of Supercomputing, 7:181{227, May 1993.

34

[4] A. E. Eichenberger and E. S. Davidson. Stage scheduling: A technique to reduce the register requirements of a modulo schedule. In Proc. of the 28th Ann. Intl. Symp. on Microarchitecture, pages 338{349, Ann Arbor, MI, Nov. 29{Dec.1, 1995. [5] J. A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Trans. on Computers, 7(30):478{490, Jul. 1981. [6] P. B. Gibbons and S. S. Muchnick. Ecient instruction scheduling for a pipelined architecture. In Proceedings of the SIGPLAN '86 Symposium on Compiler Construction, pages 11{16, Palo Alto, CA, June 25{27, 1986. [7] J. R. Goodman and W-C. Hsu. Code scheduling and register allocation in large basic blocks. In Conference Proceedings, 1988 International Conference on Supercomputing, pages 442{452, St. Malo, France, July 4{8, 1988. [8] R. Govindarajan, E. R. Altman, and G. R. Gao. Co-scheduling hardware and software pipelines. ACAPS Technical Memo 92, School of Computer Science, McGill University, Montreal, Quebec, Jan. 1995. [9] R. Govindarajan, E. R. Altman, and G. R. Gao. Co-scheduling hardware and software pipelines. In Proc. of the Second Intl. Symp. on High-Performance Computer Architecture, pages 52{61, San Jose, CA, Feb. 3{7, 1996. [10] R. A. Hu . Lifetime-sensitive modulo scheduling. In Proc. of the ACM SIGPLAN '93 Conf. on Programming Language Design and Implementation, pages 258{267, Albuquerque, NM, Jun. 23{25, 1993. [11] W. M. Hwu, et. al. The superblock: An e ective technique for VLIW and superblock compilation. J. of Supercomputing, 7:229{248, Jan. 1993. [12] P. M. Kogge. The Architecture of Pipelined Computers. McGraw-Hill Book Co., New York, NY, 1981. [13] M. Lam. Software pipelining: An e ective scheduling technique for VLIW machines. In Proc. of the SIGPLAN '88 Conf. on Programming Language Design and Implementation, pages 318{328, Atlanta, Georgia, Jun. 22{24, 1988. [14] J. Llosa, M. Valero, E. Ayguade, and A. Gonzalez. Hypernode reduction modulo scheduling. In Proc. of the 28th Ann. Intl. Symp. on Microarchitecture, pages 350{360, Ann Arbor, MI, 1995.

35

[15] S-M. Moon and K. Ebcioglu. An ecient resource-constrained global scheduling technique for superscalar and VLIW processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pages 55{71, Portland, OR, December 1{4, 1992. [16] T. Muller. Employing nite state automata for resource scheduling. In Proc. of the 26th Ann. Intl. Symp. on Microarchitecture, Austin, TX, Dec. 1{3, 1993. [17] J. H. Patel and E. S. Davidson. Improving the throughput of a pipeline by insertion of delays. In Proc. of the 3rd Ann. Symp. on Computer Architecture, pages 159{164, Clearwater, FL, Jan. 19{21, 1976. [18] T. A. Proebsting and C. W. Fraser. Detecting pipeline structural hazards quickly. In Conf. Rec. of the 21st ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, pages 280{286, Portland, OR, Jan. 17{21, 1994. [19] B. R. Rau and J. A. Fisher. Instruction-level parallel processing: History, overview and perspective. J. of Supercomputing, 7:9{50, May 1993. [20] B. R. Rau and C. D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scienti c computing. In Proc. of the 14th Ann. Microprogramming Work., pages 183{198, Chatham, MA, Oct. 12{15, 1981. [21] B.R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Ann. Intl. Symp. on Microarchitecture, pages 63{74, San Jose, CA, 1994. [22] B. R. Rau, M. S. Schlansker, and P. P. Tirumalai. Code generation schema for modulo scheduled loops. In Proc. of the 25th Ann. Intl. Symp. on Microarchitecture, pages 158{169, Portland, OR, Dec. 1{4, 1992. [23] N. J. Warter, G. E. Haab, J. W. Bockhaus, and K. Subramanian. Enhanced modulo scheduling for loops with conditional branches. In Proc. of the 25th Ann. Intl. Symp. on Microarchitecture, pages 170{179, Portland, OR, Dec. 1{4, 1992. [24] N. J. Warter, S. A. Mahlke, W. W. Hwu, and B. R. Rau. Reverse if-conversion. In Proc. of the ACM SIGPLAN '93 Conf. on Programming Language Design and Implementation, pages 290{299, Albuquerque, NM, Jun. 23{25, 1993.

36