control-flow intensive designs are dominated by nested conditional constructs, ... Speculative execution is a technique that has been used to overcome, to some extent, the ..... is selected for scheduling in a given state (statement 2). .... The following analysis presents a structured means of identifying such relationships.
Incorporating Speculative Execution into Scheduling of Control-flow Intensive Behavioral Descriptions Ganesh Lakshminarayana †, Anand Raghunathan ‡, and Niraj K. Jha † † Dept. of Electrical Engineering, Princeton University, Princeton, NJ-08544 ‡ NEC USA, C&C Research Laboratories, Princeton, NJ-08540
Abstract Speculative execution refers to the execution of parts of a computation before the execution of the conditional operations that decide whether they need to be executed. It has been shown to be a promising technique for eliminating performance bottlenecks imposed by control flow in hardware and software implementations alike. In this paper, we present techniques to incorporate speculative execution in a fine-grained manner into scheduling of control-flow intensive behavioral descriptions. We demonstrate that failing to take into account information such as resource constraints and branch probabilities can lead to significantly sub-optimal performance. We also demonstrate that it may be necessary to speculate simultaneously along multiple paths, subject to resource constraints, in order to minimize the delay overheads incurred when prediction errors occur. Experimental results on several benchmarks show that our speculative scheduling algorithm can result in significant (upto seven-fold) improvements in performance (measured in terms of the average number of clock cycles) as compared to scheduling without speculative execution. Also, the best and worst case execution times for the speculatively performed schedules are the same as or better than the corresponding values for the schedules obtained without speculative execution.
1 Introduction Scheduling is one of the most important tasks in high-level synthesis. It determines the cycle-by-cycle behavior of a design by assigning parts of the computation to be performed to particular clock cycles (control steps, or control states). Scheduling affects the performance (number of clock cycles required to perform the required computation, as well as the clock period of the implementation), power, area, and testability of the design either directly or in conjunction with other high-level synthesis tasks. The evolution of high-level design techniques has often been driven by specific application domains, such as the data-flow or arithmetic intensive domain, that includes digital signal and image processing, graphics, and several multimedia applications, and the control-flow or decision intensive application domain, that includes networking/telecommunication protocols, embedded controllers, etc. The behavioral descriptions of data-flow intensive designs are dominated by arithmetic operations such as addition, subtraction, and multiplication, while those of control-flow intensive designs are dominated by nested conditional constructs, data-dependent loops, and comparisons, with very few arithmetic operations. The area, delay, and power of structural RTL implementations of dataflow intensive designs are dominated by arithmetic units and registers in the data path, while in the case of controlflow intensive designs, they are dominated by non-arithmetic units like multiplexers, bit-manipulation units, and comparators. In practice, a large number of designs tend to contain significant amounts of control flow as well as data flow. For such designs, the control-flow constructs often impose bottlenecks on the performance achievable using hardware and software implementations alike [1, 2, 3]. Speculative execution is a technique that has been used to overcome, to some extent, the performance bottlenecks imposed by control flow. Speculative execution refers to the execution of a part of a computation before it is known that the control path to which it belongs will be executed (for example, execution of the code after a branch statement before the branch condition itself is evaluated). There has been previous work on speculative execution in the areas of high-level synthesis [3, 4, 5] as well as high-performance compilation [6, 7, 8, 9, 10, 11]. This paper presents techniques to integrate speculative execution into scheduling during high-level synthesis of control-flow intensive designs. In that context, we demonstrate that not using information such as resource constraints and branch probabilities while deciding when to speculate can lead to significantly sub-optimal performance. We also demonstrate that it is necessary to perform speculative execution along multiple paths at a fine-grain level, in order to obtain maximal benefits. In addition, we present techniques to automatically manage the additional speculative results that are generated by speculatively executed operations. We show how to incorporate speculative execution into a generic scheduling methodology, and in particular present its integration into Wavesched [12]. Experimental results for various benchmarks and examples are presented that indicate upto seven-fold improvement in performance (average number of clock cycles required to perform the computation).
2 Background and related work Scheduling tools typically work using one or more intermediate representations of the behavioral description, such as a data flow graph (DFG), control flow graph (CFG), or control-data flow graph (CDFG). In this paper, we use the CDFG as the intermediate representation of a behavioral description, and state transition graphs (STGs) to represent the scheduled behavioral description, as explained in later sections. In addition to the behavioral description, our scheduler also accepts the following information: A constraint on the number of resources of each type available (resource allocation constraints). The target clock period for the implementation, or constraints that limit the extent of data and control chaining allowed. 1
Profiling information that indicates the branch probabilities for the various conditional constructs present in the behavioral description.
Most scheduling techniques can be classified into two broad categories – resource -constrained and time-constrained. Resource-constrained scheduling techniques assume that the set of resources (functional units and/or registers) that will be used to implement the design is specified, and attempt to minimize the number of clock cycles required to perform the computation. Time-constrained scheduling techniques assume that a fixed number of clock cycles are available to perform the computation, and attempt to minimize the number of resources required. Scheduling techniques and tools for data-flow intensive designs are primarily concerned with exploiting the tradeoff between parallelism (or concurrency) and resource requirements [13]. Among the simplest scheduling techniques are as soon as possible (ASAP) and as late as possible (ALAP) scheduling, where operations are scheduled as soon as or as late as allowed by the data and control dependencies. List scheduling is a resource-constrained scheduling technique, where a list of operations that are ready to be scheduled, sorted based on some notion of priority or urgency, is maintained. Scheduling proceeds one control step at a time, and the operation at the head of the list is assigned to the current control step if a resource that can execute it is available. Various heuristics have been proposed to associate a priority function to operations in the candidate list [13]. Force-directed scheduling [14] is a global, time-constrained scheduling technique that assigns operations to control steps based on a quantity called the force, which is computed using probability distribution graphs of resource utilization. In time-constrained iterativeimprovement based scheduling [15], a simple initial schedule (such as ASAP/ALAP) is refined by applying refinements iteratively, or as sequences of moves, in order to improve its resource requirements. For control-flow intensive designs, scheduling techniques have focussed on exploiting the mutual exclusion of operations in the description that is imposed by conditional constructs. For example, operations of the same type (e.g., addition, subtraction, etc.) that are mutually exclusive may be scheduled in the same clock cycle without requiring a separate functional unit to implement each of them. Similarly, mutually exclusive paths of computation may be scheduled independently, optimizing each path differently. Some of the above described techniques, such as list and force-directed scheduling, have been extended and applied to control-flow intensive designs. In [14], extensions to force-directed scheduling are proposed that allow information about mutual exclusion of operations to be considered when computing resource utilization during scheduling. A list scheduling based technique using condition vectors is described in [4]. Apart from accounting for mutual exclusion while computing resource utilization, the scheduling technique also includes features such as operation and conditional node replication, parallelization of multiple conditional trees, and scheduling of operations independent of control dependencies. In path-based scheduling [16], each control path is scheduled independently, as fast as possible, and the individual schedules are combined into a single schedule while attempting to minimize the number of controller states. Enhancements to path-based scheduling, including re-ordering operations within basic blocks to exploit data flow information, and control partitioning to avoid explosion in the number of paths are presented in [17]. Symbolic scheduling [18] attempts to represent all possible solutions to the scheduling problem, i.e., all feasible schedules, using ordered binary decision diagrams (OBDDs). While the quality of solutions is optimal or close to optimal, the size of the representation may become unmanageable for large designs. A significant feature of control-flow intensive designs is the presence of nested data-dependent conditional loops (e.g., while loops). The scheduling techniques presented above do not perform efficient optimization of loops. For example, most of them handle a data-dependent loop by considering the acyclic loop body alone. Thus, techniques
2
analogous to ones such as loop unrolling and winding, which have been shown to be highly efficient in optimizing data-flow intensive designs, are not employed. Loop-directed scheduling [19] overcomes the above deficiency by optimizing loops by dynamically performing conditional loop unrolling, when necessary, during scheduling. This is shown to result in a significant reduction in the average number of clock cycles required to execute behavioral specifications that contain data-dependent loops. However, loop-directed scheduling operates on a CFG and does not exploit parallelism. Wavesched is a scheduling technique for control-flow intensive designs that exploits both parallelism and mutual exclusion, and incorporates comprehensive data-dependent loop optimizations, including dynamic loop unrolling and winding, and parallel execution of data-dependent loops [12]. A few attempts have been made to incorporate speculative execution into high-level synthesis. The scheduling algorithm presented in [4] is capable of scheduling operations independent of control dependencies, resulting in speculative execution. However, the algorithm considers acyclic CDFGs, and is hence not well-suited for optimizing data-dependent loops. Speculative execution was shown to have significant performance benefits in coprocessor synthesis in [3], and was combined with loop pipelining in [5]. However, the techniques presented there are based on performing speculative execution along a single path (the most probable path) only, and executing the less probable paths only after it has been confirmed that a prediction error has occurred. This is a coarse-grain approach to incorporating speculative execution into high-level synthesis, which, as shown in the following section, may lead to significantly sub-optimal solutions. Moreover, we demonstrate that it is important to use information about resource constraints and branch probabilities during the process of fine-grain speculative execution. Our algorithm automatically incorporates speculative execution based on the above information, so that maximal performance improvements are attained. An approach to performing register synthesis for storing the intermediate results generated by speculatively executed operations is described in [20]. It is shown that a shift-register structure can be used to store the different versions of speculatively generated variables, leading to an area-efficient implementation. This technique is applicable to register synthesis after the scheduling step, and can be used in conjunction with our scheduling technique as well. Speculative execution has also been used in other domains such as high-performance processor design and parallelizing compiler design. In microprocessors, branch prediction is a commonly used hardware-supported form of speculative execution. Most modern high-performance processors provide support for speculative execution in various ways [21, 22, 23]. Recognizing that control flow imposes significant limitations on the achievable instruction level parallelism, several compiler techniques have been proposed that incorporate speculative execution [6, 7, 8, 9, 10, 11].
3 Incorporating speculative execution during scheduling In this section, we present some motivational examples to illustrate the use of speculative execution during scheduling. Example 1: Consider a part of a behavioral description and the corresponding CDFG fragment shown in Figure 1, that contains a while loop. The CDFG contains vertices corresponding to operations of the behavioral description, where solid lines indicate data dependencies, and dotted lines indicate control dependencies. Control edges in the CDFG are annotated with a variable that represents the result of the conditional operation that generates them. For example, the control edges fed by operation > 1 are marked c in Figure 1. The initial values of variables i and t4 used in the loop body are indicated in parentheses beside the corresponding CDFG data edges. Let us now consider the task of scheduling the CDFG shown in Figure 1. Suppose we have the following constraints to be used during scheduling. 3
k (0)
>1 ++1
c c
i (0)
M1
C1
t1
... i := 0; t4 := 0; while (k > t4) { i := i + 1; t1 := M1[i]; t2 := t1 * C1; t3 := t2 * C2; t4 := t3 + C3 M2[i] := t4; } ...
*1
C2
t2
*2
C3
t3 +1 t4 M2
Figure 1: Example CDFG used to illustrate speculative execution
The target clock period allows the execution of +, ++, >, and memory access operations in one clock cycle, while the ∗ operation requires two clock cycles. In addition, we assume that the ∗ operation will be implemented using a 2-stage pipelined multiplier. No operation chaining is allowed, since it leads to a violation of the target clock period constraint (in general, however, our algorithm can handle chaining). The aim is to optimize the performance of the design as much as possible. Hence, no resource constraints are specified for the purposes of illustration for this example. This is not a limitation of our scheduling algorithm, which does handle resource constraints as described in later sections.
A schedule for the CDFG that does not incorporate speculative execution is shown in Figure 2(a). This schedule can be obtained by applying either the loop-directed scheduling [19] technique or the Wavesched [12] technique on the CDFG. Vertices in the STG represent schedule states, that directly correspond to states in the controller of the RTL implementation. Each state is annotated with the names of the CDFG operations that are performed in that state, including a suffix that represents a symbolic iteration index of the CDFG loop that the operation belongs to. For example, consider operation > 1 of the CDFG. When > 1 is encountered the first time during scheduling, it is assigned a subscript 0, resulting in operation > 1 0 in the STG of Figure 2(a). In general, multiple copies of an operation may be generated during scheduling, corresponding to different conditional paths, or different iterations of a loop. Operation > 1 1 in the STG of Figure 2(a) corresponds to the execution of the first unrolled instance of CDFG operation > 1. We employ the general indexing scheme presented in [12] to keep track of various instances of a CDFG operation that are generated during scheduling. Hence, the indexing technique is not explained here in detail. An edge in the STG represents a controller state transition, and is annotated with the conditions that activate the transition. These conditions are logic expressions consisting of results of conditional operations that are executed in the source state.
4
S0
>1_0, ++1_0/c_0 c_0
S1 ++1_1/ c_1, M1_0 c_0
++1_2/ (c_1
c_2),
S2
S0
M1_1/c_1, *1_0
>1_0 c_0
S1
++1_0
S2
M1_0
S3
*1_0
S4
*1_0
S5
*2_0
S6
*2_0
S3 ++1_3, M1_2, *1_0, *1_1
S9 S4
++1_4, M1_3, *1_1, *1_2, *2_0
STOP
c_0
++1_5, M1_4, *1_2, *1_3,
c_1
S5
c_1
*2_0, *2_1
++1_6, M1_5, *1_3, *1_4, S6
S7
S8
+1_0
c_2
*2_1, *2_2, +1_0
M2_0, >1_1 S8
S7 >1_1, ++1_7, M1_6, *1_4, *1_5,
c_1 S9
STOP
*2_2, *2_3, +1_1, M2_0
(a)
c_1
>1_2, ++1_8, M1_7, *1_5, *1_6,
c_2 *2_3, *2_4, +1_2, M2_1
(b)
Figure 2: (a) Non-speculative schedule, and (b) Schedule incorporating speculative execution for the CDFG of Figure 1 Each iteration of the loop in the scheduled CDFG requires eight clock cycles. For this example, the data dependencies among the operations within the loop require them to be performed serially. In addition, the control dependencies between the comparison operation > 1 and operations + + 1 and M1, together with the inter-iteration data dependency 1 from +1 to > 1, prevent the parallel computation of multiple loop iterations, even when loop unrolling is employed. In other words, the cycle in the CDFG that involves operations > 1, M1, ∗1, ∗2, and +1 poses a bottleneck, much like the “recursive bottleneck” for data-flow intensive designs [24, 25], that limits the performance of the schedule. Use of speculative execution allows us to ignore the control dependencies, and hence break the above bottleneck, as explained next. 1 An intra-iteration data or control dependency is between operations that correspond to the same iteration of a loop, while an interiteration dependency is between operations in different (e.g., consecutive) iterations. We refer to intra-iteration data and control dependencies simply as data and control dependencies.
5
A schedule for the CDFG of Figure 1 that incorporates speculative execution is shown in the STG of Figure 2(b). This schedule was derived by techniques we present in later sections. Speculatively executed operations are annotated with the conditional operations whose results they depend upon, using the following notation. op/cond represents an operation op that is executed assuming that the speculation condition cond will evaluate to true. The speculation condition cond could, in general, be an expression that is a conjunction of the results of various conditional operations in the STG. For example, consider operation + + 1 1/c 1 in state S1 of Figure 2(b). This is a speculatively executed operation, that corresponds to the second instance of CDFG operation + + 1 in the schedule, and assumes that the result of conditional operation > 1 1, which is executed only in state S7, is going to be true. Operation + + 1 2/(c 1 ∧ c 2) in state S2 of Figure 2(b) represents the third instance of + + 1, and the condition speculated upon is c 1 ∧ c 2 2 . The results generated by such speculatively executed operations are called speculative results, and may or may not be used depending on the evaluation of conditional operations during later clock cycles or control states. The speculation condition of a speculative result is defined to be the speculation condition of the operation that generates it. When a conditional operation c is executed, our scheduler automatically generates code to resolve all speculative results whose speculation conditions involve c. States S7 and S8 in Figure 2(b) represent the steady-state iteration of the while loop in the CDFG of Figure 1 (the sequence of states traversed corresponding to an input sequence that causes the while loop to be executed a large number of times is . . . , S7, S8, S7, S8, S7, S8, . . .). Note that state S7 has exactly one instance of each CDFG operation. Speculatively executed operations in states S3, . . . , S8 are not annotated with the conditions they are speculated upon for the sake of clarity. However, it is easy to see that the speculation condition for an operation instance op i is
Vi
k, where c j represents the earliest unexecuted instance of the conditional operation > 1. c ( j ? 1), . . . , c 0 are not used since the corresponding operations have been executed in earlier states, and their results have already been k= j c
used to determine the current state, as well as to resolve the variables used as operands in the current state.
S7
M2_0
>1_1
+1_1
*2_2
*2_3
*1_4
*1_5
M1_6
++1_7
S8
M2_1
>1_2
+1_2
*2_3
*2_4
*1_5
*1_6
M1_7
++1_8
S7
M2_2
>1_3
+1_3
*2_4
*2_5
*1_6
*1_7
M1_8
++1_9
S8
M2_3
>1_4
+1_4
*2_5
*2_6
*1_7
*1_8
M1_9
++1_10
S7
M2_4
>1_5
+1_5
*2_6
*2_7
*1_8
*1_9
M1_10
++1_11 ++1_11
Figure 3: Steady-state operation of the speculative schedule of Figure 2(b) 2∧
represents the Boolean AND operation.
6
In order to illustrate the operation of the STG of Figure 2(b) in states S7 and S8, consider an input test case for which the while loop in the CDFG is executed a large number of times. Figure 3 shows an unrolled version of the STG for five consecutive cycles starting from the first time state S7 is executed. The figure illustrates the “iteration threads” that connect operations that belong to the same iteration of the while loop. It can be seen that a new iteration of the while loop is speculatively initiated in each clock cycle. Thus, when the while loop is executed a large number of times, the average number of clock cycles required for each iteration is close to one. In effect, the use of speculative execution has eliminated the control dependency bottlenecks, allowing us to pipeline the while loop in the CDFG for a throughput of one clock cycle. In general, the behavioral description may contain arbitrarily nested data-dependent loops and conditionals, resulting in significantly more complex speculative schedules than the one shown in Figure 2(b). Hence, it is important to have efficient techniques to automatically incorporate speculative execution during scheduling. From the previous example, it can be seen that the tasks that need to be performed by a scheduler in order to incorporate speculative execution include the following:
Automatically decide which operations to speculatively execute based on the CDFG structure, resource constraints, and CDFG profiling statistics (branch probabilities). This includes the objective of ensuring that the performance penalty incurred when prediction errors occur is not excessive. Automatically generate and manage new intermediate variables to store results of speculatively executed operations. Generate code in the schedule to dynamically resolve speculative results automatically when the conditional operations that they depend on have been evaluated.
All the above tasks are automatically performed by our scheduling technique that is described in later sections. The benefit of incorporating speculative execution into the scheduling process (as opposed to applying it as a preprocessing step to scheduling) is that detailed information that is available during scheduling, such as resource constraints, branch probabilities, etc., can be factored in when making decisions involving speculation. The importance of using such information to support speculative execution is demonstrated using the following examples. Example 2: Consider the example CDFG shown in Figure 4. The select operation Sel1 selects the data operand at a
++1
21 b
>1
(c1)
c c
d
(c1) +1
+2 3
>>1 l s
r
e
Sel1
*1 out
Figure 4: Example CDFG used to demonstrate the effect of resource constraints and branch probabilities on speculative execution
7
its l (r) port if the value at its s port is 1 (0). Figure 5 shows three different schedules that use speculative execution, that were generated using different resource constraints and branch probabilities. The STG of Figure 5(a) was generated assuming the following information: ++1, +1 / c1 S0
S0
++1, +2 / c1
++1, +1 / c1, +2 / c1 S0
>1,*1 / c1, +1 / c1 S1
>>1 / c1, >1, *1 / c1 S1
c1
c1
*1
>>1 S2
*1
>>1 / c1, >1, +2 / c1 S1
S2
S2
*1
c1 STOP S3
S3
c1
STOP S3
STOP S4
(a)
(b)
(c)
Figure 5: Three speculative schedules derived using different resource constraints or branch probabilities
Resource constraints of one incrementer (++), one adder(+), one comparator (>), one shifter (>>), and one multiplier (∗). All units require one clock cycle, and no chaining is allowed. The probability of comparison > 1 evaluating to f alse is higher than it evaluating to true.
Since the result of > 1 evaluates to f alse more often, the schedule of Figure 5(a) gives preference to executing operations from the corresponding control path (e.g., +2). As a result, +2 is scheduled to be performed on the sole adder in state S0, as opposed to +1, even though the data operands for both operations are available. The average number of clock cycles, CCa , required for the STG in Figure 5(a) can be calculated as follows. CCa = 4.P(c1) + 2.(1 ? P(c1)) = 2.P(c1) + 2
(1)
In the above equation, P(c1) represents the probability that the result of comparison > 1 evaluates to true. The STG of Figure 5(b) was derived with the same information as above, except that it was assumed that comparison > 1 evaluates to true more often than it evaluates to f alse. Hence, operation +1 is given preference over operation +2 and is scheduled in S0. The average number of clock cycles, CCb, required for the STG in Figure 5(b) is given by the following expression. CCb = 3.P(c1) + 3.(1 ? P(c1)) = 3
(2)
Suppose the resource constraints were relaxed to allow two adders. The speculative schedule that results is shown in Figure 5(c). The average number of clock cycles, CCc , required for the STG in Figure 5(c) is given by the following expression. CCc = 3.P(c1) + 2.(1 ? P(c1)) = P(c1) + 2
(3)
The values of CCa , CCb , and CCc for various values of P ranging from 0 to 1 are plotted in Figure 6. As expected, the schedule of Figure 5(a) outperforms the schedule of Figure 5(b) when P(c1) < 0.5, and the schedule of Figure 5(b) performs better when P(c1) > 0.5. Moreover, the schedule of Figure 5(c), which was derived using one 8
4.0 CCa
Expected Number of Cycles
3.5
CCb
3.0
CCc
2.5 2.0 1.5 1.0 0.5 0.0
0
0.1
0.2
0.3
0.4 0.5 0.6 Probability (P)
0.7
0.8
0.9
1
Figure 6: Comparison of the speculative schedules shown in Figure 5 extra adder, outperforms the other two schedules for all values of P(c1). Thus, we can conclude that branch probabilities and resource constraints do influence the trade-offs involved in deciding which conditional paths to speculate upon, making the case for the integration of speculative execution into the scheduling step where such information is available. The following example illustrates that it is necessary to perform speculative execution along multiple paths, in a fine-grained manner, in order to obtain maximal performance improvements. Example 3: The schedules shown in Figure 5 were all generated by speculatively executing operations from both ++1, +1 / c1 S0
>>1 / c1, >1 S1 c1
c1
1 S2
*
+2
S3
*1
S4
STOP S5
Figure 7: Speculation along a single path the conditional paths of the CDFG in a fine-grained manner, as allowed by the resource constraints. For the purpose of comparison, we scheduled the CDFG shown in Figure 4, assuming the same scheduling information that was assumed to derive the schedule of Figure 5(b). However, in this case, we restricted the scheduler to allow speculative execution along only one path. The resulting schedule is shown in Figure 7. The average number of clock cycles, CCd , required for the STG in Figure 7 is given by the following expression. CCd = 3.P(c1) + 4.(1 ? P(c1)) = 4 ? P(c1)
9
(4)
Comparing the expression for CCd to the expression for CCb from the previous example indicates that CCd ≥ CCb for all feasible values of P(c1). Thus, in this example, simultaneously speculating along multiple paths according to resource availability results in a schedule that is provably better than one derived by speculating along only the most probable path. Our scheduling algorithm automatically decides the best paths to speculate upon for the given resource constraints and branch probabilities.
4 The Algorithm In this section, we detail the working of our algorithm with illustrative examples. We first present the changes which need to made to a generic scheduling algorithm to support speculative execution. This is done in three parts. The first part describes a generic scheduler. The second and third parts provide, respectively, an overview and a detailed description of the modifications that need to be made to the generic scheduler to support speculative execution. We then briefly describe a particular scheduling algorithm. The concepts, formulated in the context of a generic scheduler, are then applied to a particular scheduling algorithm. 4.1 A generic scheduling algorithm
Generic scheduler (CDFG G, A LLOCATION CONSTRAINT C, M ODULE SELECTION INFO M inf , CLOCK PERIOD clk) f S ET Unscheduled operations; S ET Schedulable operations; 1 while (jUnscheduled operationsj > 0) f // while unscheduled operations remain 2 op = Select schedulable operation (Schedulable operations, C, M inf , clk); //Select an operation for scheduling. The selected operation must honor // allocation and clock cycle constraints 3 Schedule(op); 4 Unscheduled operations.remove operation(op); 5 Schedulable operations.remove operation(op); 6 S ET schedulable successors = Compute schedulable successors(op); //Find the set of operations in the fanout of op which become schedulable when op is scheduled 7 Schedulable operations.append(schedulable successors); //Augment Schedulable operations by addition of operations in schedulable successors
g
g
Figure 8: Pseudocode for a generic scheduling algorithm Figure 8 shows the pseudocode for a generic scheduling algorithm. The inputs to the scheduler are a CDFG, G, to be scheduled, the target clock period of the design, allocation constraints, which specify the numbers and types of functional units available, and module selection information, which gives the type of functional unit an operation is mapped to. The output of the scheduler is an STG which describes the schedule. At any point, a generic scheduler maintains (a) the set of unscheduled operations whose data and control dependencies have been satisfied, and can therefore be scheduled (Schedulable operations), and (b) the set of operations which are unscheduled (Unscheduled operations). The scheduling process proceeds as follows: an operation from Schedulable operations is selected for scheduling in a given state (statement 2). The selection should honor allocation and clock cycle constraints. The manner in which the selection is done varies from one scheduling algorithm to another. The selected operation, op, is scheduled in the state. Since op no longer belongs to either Schedulable operations or 10
Unscheduled operations, it is removed from these sets (statements 4 and 5). Also, the scheduling of op might render some of the operations in its fanout schedulable. The routine Compute schedulable successors (statement 6) identifies such operations, and these operations are subsequently included in the set Schedulable operations (statement 7). 4.2 Incorporating speculative execution into a generic scheduler: An overview We now provide an overview of the changes that need to be made to incorporate speculative execution into the framework of the generic scheduler shown in Figure 8. op1
op0
c(op1) c(op1) op2
op3 l s
r Sel1
op4
Figure 9: A CDFG fragment illustrating speculative execution To support speculative execution, the generic scheduler shown in Figure 8 needs to be modified as follows. 1. When an operation is scheduled, one needs to recognize all its schedulable successors, including the ones which can be speculatively scheduled. In addition, speculatively executed operations and their successors need to be specially marked. Clearly, procedure Compute schedulable successors needs to be augmented to consider such cases. This is done by means of a node-tagging scheme, detailed in Section 4.3. Example 4: Consider the CDFG fragment shown in Figure 9. We assume that operation op0 is scheduled, operation op2 has just been scheduled, and operations op1, op3, Sel1, and op4 are unscheduled. The output of the routine Compute schedulable successors(op2) must include operation op4, which can now be speculatively executed, i.e., its operands can be assumed to be the results of operations op2 and op0. 2. When operations are scheduled, control and data dependencies of speculatively executed operations are resolved. This would potentially validate or invalidate speculatively performed operations. Operations which are validated should be considered “normal”, i.e., they need not be specially marked any longer. Operations in Unscheduled operations and Schedulable operations which are invalidated need no longer be considered for scheduling. They can, therefore, be removed from these sets. In general, the resolution of the control or data dependencies of a speculatively performed operation creates two separate threads of execution, which correspond to the success and failure of the speculation. The procedure for handling validations and invalidations of operations is detailed in Section 4.3. Example 5: Consider again, the CDFG fragment shown in Figure 9. Suppose operations op0, op2 and op4 have been scheduled, and operation op3 is unscheduled. Operation op4 uses as its operands, the results of operations op2 and op0. Assume that operation op1 has just been scheduled. If op1 evaluates to true, then the execution of op4 can be considered fruitful, because the operands chosen for its computation are correct. Therefore, op4, and its scheduled and schedulable successors need not be considered conditional on the result of op1 anymore, and the data structures can be modified to reflect this fact. If, however, op1 evaluates to false, then op4 should use as its operands, the results of operations op3 and op0, thus invalidating the result of our speculation. Therefore, schedulable operations, whose computations are influenced by the result computed by op4 are invalid, and can be removed from the set Schedulable operations.
11
3. The set, Schedulable operations, from which an operation is selected for scheduling, contains operations whose execution is speculative, i.e., whose results are not always useful. The selection procedure, represented by the routine Select schedulable operation() (statement 2), needs to be modified to account for this fact. For example, operations, whose execution is extremely improbable, would make poor selection candidates, as the resources consumed by them might be better utilized by operations whose execution is more probable. Also, operations, which fall on critical paths, would be better candidates for selection than those on off-critical paths. A selection procedure which addresses these problems is considered in Section 4.3. 4.3 Incorporating speculative execution into a generic scheduler: A closer look In this section, we fill in the details of the changes outlined in Section 4.2. This is preceded by a formal treatment of concepts related to speculative execution. A scheduler which supports speculative execution works with conditioned operations as its atomic schedulable units, just as a normal scheduler uses operations. Therefore, the fanin-fanout relationships between operations, captured by the CDFG, need to be defined for conditioned operations. Since all speculatively performed operations are conditioned on some event, the adjective “speculatively performed” when applied to an operation, implies that it is conditioned on some event or combination of events. As mentioned in Section 4.2, when an operation is scheduled, its schedulable successors need to be computed. In general, it is possible to schedule the successors of a speculatively scheduled operation, as illustrated in Example 6.
op1
c(op1)
op4
c(op4)
c(op1)
op2
r
l s
c(op4)
op5
op3
op6
r
l
Sel1
s
Sel2
op7
Figure 10: Illustrating the scheduling of successors of speculatively performed operations Example 6: Consider the CDFG fragment shown in Figure 10. Assume that operations op5 and op6 have been scheduled, operations op1, op3, and op4 are unscheduled, and op2 has just been scheduled. It is now possible 0
to schedule two versions of operation op7, with the first version, op7 , using op2 and op5 as its operands, and the second, op7 using op2 and op6. op7 is conditioned on c(op1) ∧c(op4), and op7 is conditioned on c(op1) ∧c(op4). 00
0
00
The following analysis presents a structured means of identifying such relationships. We now present a result which helps derive fanin-fanout relationships among speculatively performed operations. Lemma 1: Consider an operation, op, whose fanins are op1, op2, . . ., opn. If the fanins of op have been speculatively scheduled, so can op. In particular, if the ith fanin, opi, is conditioned on Ci , then op would be conditioned on
Vn
i=1 Ci .
We now present details of Steps 1, 2, and 3, outlined in Section 4.2. Step 1: This step addresses the issue of deriving all schedulable successors of a scheduled operation, op. The result of Lemma 1 is used for this procedure. 12
Observation 1 Every set, S = fop, op1, . . . , opig of scheduled operations, which satisfies the following condition sources a schedulable operation. Condition: There exits an operation, fanout, in the CDFG, all of whose fanins are reachable from the outputs of the operations in S through paths which consist exclusively (if at all) of select operations. The path connecting the output of an operation opj, in S, to an input of fanout is denoted by Pj, and the operations on Pj are Selj1, Selj2, . . ., Seljaj. Note that aj can equal 0. C j represents the condition that path Pj is selected, i.e., the result of operation opj is V propagated through path Pj to the appropriate input of fanout. Operation fanout is conditioned on ik=1(C(opk) ∧Ck ) where C(opk) represents the expression opk is conditioned on. Observation 1 can be used to infer the schedulable successors of an operation. The procedure Compute schedulable successors, which is called in statement 6 of the pseudocode shown in Figure 8 is appropriately augmented. Example 7: Consider the CDFG fragment shown in Figure 10. The sets of scheduled and unscheduled operations are the same as that of Example 6 (op1, op3, and op4 are unscheduled, op5 and op6 are scheduled, and op2 has
just been scheduled). Observation 1 can be applied to the set S = fop2, op5g, with operation op7 taking the place of fanout. As specified in Observation 1, op2 and op5 feed inputs of op7 by paths passing exclusively through select operations (Sel1 (Sel2) in the case of op2 (op5)). Therefore, op7/(c(op1) ∧ c(op4)) is a valid successor to the operations in S. Likewise, op7/(c(op1) ∧ c(op4)) is a valid successor to the set fop2, op6g. So far, we have described the technique used to identify all schedulable successors of an operation. This was
accomplished by tagging operations with the conditions under which their results would be valid. Note that our procedure allows us to speculate on all possible outcomes of a branch, and arbitrarily deeply into nested branches. If integrated with a scheduler which supports loop unrolling, the speculation could also cross loop boundaries. We now present the technique used to validate or invalidate speculatively performed operations whose dependencies have just been resolved. Step 2: This step addresses the issue of validating and invalidating operations when the conditions upon which they are speculated are resolved. Suppose operation ops , which resolves a condition c, has just been scheduled. The resolution of c results in the creation of two different threads of execution, where (i) c = true, and (ii) c = false. The following procedure is carried out for every operation, op, which belongs either to the set, Schedulable operations, V of schedulable operations, or the set of scheduled operations. Let op be conditioned on C = ij=1 c j . In the true (false) branch, C is evaluated assuming a value of 1 (0) for c, and the resultant expression is the new expression that op is conditioned on. Example 8: Consider the CDFG fragment shown in Figure 10 again. We assume that operations op7/(c(op1) ∧ c(op4)) and op7/(c(op1) ∧ c(op4)) have been scheduled, and operation op4, which decides c(op4), has just been scheduled. As a result of this, the STG forks off into two branches, with branch A, representing a true value, and branch B representing a false value for c(op4). In branch A, operation op7/(c(op1) ∧ c(op4)) is now conditioned on c(op1) ∧ c(op4) evaluated assuming c(op4) = 1, which computes to 0. Similarly, op7/(c(op1) ∧ c(op4)) is now conditioned on c(op1) ∧ c(op4) evaluated assuming c(op4) = 1, which is c(op1). The result of an operation conditioned on 0 will never be used. Therefore, every operation conditioned on 0 can be removed from the set of scheduled or schedulable operations to prevent it from sourcing successors. Step 3: We now describe the procedure employed by our scheduler to select an operation to schedule, from a pool of schedulable operations, Schedulable operations. Schedulable operations can contain operations which are conditioned on different sets of events, i.e., we can choose different paths to speculate upon. We need to decide the 13
“best” candidate for scheduling on a given resource, where, by best, we mean the operation whose scheduling on the given resource would minimize the expected number of cycles for the schedule. Formally, the problem can be stated as follows: given (i) a partial schedule, (ii) a functional unit, fu, (iii) a set of operations, S (some of which may be speculative), which can execute on the functional unit, and (iv) typical input traces, select the operation, which, if scheduled on fu, would minimize the expected number of cycles. The above problem has been proven NP-complete, even for conditional- and loop-free behavioral descriptions [26]. We, therefore, use the following heuristic, whose guiding principle has been successfully employed by several scheduling algorithms [27]. The heuristic is based on the following premise: operations in the CDFG which feed primary outputs through long paths are more critical than operations which feed primary outputs through short paths and, therefore, need to be scheduled earlier. The rationale behind this heuristic is that operations which belong to short paths are more mobile than those on long paths, i.e., the total schedule length is less sensitive to variations in their schedules. The length of a path is measured as the sum of the delays of its constituent operations. In data-dominated descriptions, with no loops and conditional operations, the longest path between any pair of operations is fixed. In control-flow intensive descriptions, some paths could be input-dependent. Therefore, the longest path between a pair of operations must be defined with respect to a given input. For example, for the CDFG shown in Figure 4, the longest path connecting primary input c with output out depends upon the value taken by operation > 1. Since our scheduling algorithm is geared towards minimizing the average execution time, we use the expected length of the longest path from an operation to a primary output as a metric to rank different operations. We use the notation λ(op) to denote this quantity. Speculation adds a new dimension to this problem: the result computed by an operation is not guaranteed to be useful. For an operation, op, we account for this effect by multiplying the probability that an operation’s output is utilized, with λ(op) to derive a metric of an operation’s criticality. This is expressed by means of the following equation: i
criticality(op) = λ(op) ∏ P(c j )
(5)
j=1
where criticality(op) measures the desirability of scheduling op, ∏ij=1 P(c j ) is the product of the probabilities of the events that op is conditioned on, and λ(op) is as defined in the previous paragraph. Example 9: We now illustrate the application of this heuristic to the scheduling of the CDFG shown in Figure 4, under the following resource constraints: one adder, one shifter, one multiplier, one incrementer, and one comparator. All functional units are assumed to execute in one cycle. Suppose the CDFG has been partially scheduled, and the current STG is shown in Figure 11. S0 is the state under consideration, and operation + + 1 has been scheduled
in it. We need to pick one operation from the set f+1/c1, +2/c1g to execute on the adder. Operation > 1 is assumed to evaluate to true (false) with a probability of 0.6 (0.4). We now evaluate criticality(+1/c1) and criticality(+2/c1g, ++1
S0
Figure 11: A partially constructed STG to illustrate operation selection and pick the operation with a higher criticality to schedule in S0. Using Equation (5), criticality(+1/c1) = λ(+1/c1) P(c1) = 2 0.6 = 1.2 14
Queue.enque(S0)
1
2
S = Queue.deque() Sch_op = Sched_succ[S] Queue empty
Create state(New_state)
3
For each combination of conditions
S = Initially schedulable operations, S0 = Initial state Sched_succ[S0] = S
Op = Select(S_part)
Sched_succ[New_state] = S_part
5
Edge, labeled C from S to New_state 10 Queue.enque(New_state)
6
Schedule Op in New_state, Add newly schedulable operations to S_part, Remove Op from S_part
Edge, labeled C from S to S_old 11 7
Queue not empty
Partition Sch_Op into sets, S_Part, of operations executed under the same combination of conditions, C 4
9
Yes Can more operations be scheduled in New_state?
Yes 8
Is New_state equivalent to ex− isting state, S_old ? 12
No
No
STOP
Figure 12: Flow diagram of our scheduling algorithm Likewise, criticality(+2/c1) can be computed to be 1 0.4 = 0.4, leading to the selection of +1/c1. The expected number of cycles for the schedule derived by the application of this heuristic is 3. If +2/c1 were chosen instead, the expected number of cycles can be proven to have a lower bound of 3.2. 4.4 Our scheduling algorithm In this section, we describe our scheduling algorithm, extended to support speculative execution. We first explain the working of our algorithm, and present an illustrative example. Figure 12 details the working of our algorithm. The boxes shown shaded are modified to support speculative execution. The algorithm maintains, for each state, the set of operations schedulable immediately upon leaving the state. The array Sched succ, indexed by state, contains this information. Once a new state has been formed, it is enqueued (steps 10 and 2). States are successively dequeued (step 3), and their immediate successor states are inferred. If the successor states are new states, they are in turn enqueued (step 10), and so on. The process terminates with the construction of the STG. We now briefly describe the process of inferring the successors of a dequeued state. Once a state, S, is dequeued, the set, Sch op, of operations schedulable upon leaving it is obtained from the array Sched succ (step 3). Sch op can, in general, contain operations executing under different combinations of conditions. For example, if there are three operations, op1, op2, and op3, which are control-dependent on c1, c2, and c2, respectively, there are four possible combinations of conditions, (i) c1 ∧ c2, (ii) c1 ∧ c2, (iii) c1 ∧ c2, and (iv) c1 ∧ c2. A1 = fop3g, A2 = fop2g, A3 = fop1, op3g, A4 = fop1, op2g are the sets of operations which execute under conditions (i), (ii), (iii), and (iv), respectively. The edge from step 4 to step 5 is traversed for each of these
conditions. A1, A2, A3, and A4 represent S part for successive traversals of the edge, and a new state is created for scheduling each of these sets. Step 5 creates a state which is responsible for scheduling the operations in S part. New state grows as operations are scheduled in it. When an operation, op, is scheduled, its schedulable successors are inferred and added to S part, and op is removed from S part. When no more operations can be scheduled in New state, the algorithm checks 15
whether it is equivalent to any pre-existing state. If not, it is enqueued, and Sched succ[New state] is assigned S part. x y
y x
x
y
>1 c1
−2
c1
0
−1 x
y !1
== 1 c2 ||1 c3
c3
Stop
Figure 13: CDFG for GCD
S1
S0
S0
S0
S1
>1_0, −1_0/c1_0
S1 >1_0, −1_0/c1_0
>1_0, −1_0/c1_0
!1_0, ==1_0, >1_1/c2_0, −1_1/(c2_0
c1_1), ||1_0
!1_0, ==1_0, >1_1/c2_0, −1_1/(c2_0
c1_1), ||1_0
c1_1
c2_0
c1_1
−2_0 S3
S2
S2 c2_0
c1_0
c1_0
c1_0
c2_0
c2_0
c1_1
S4 STOP
(a)
(b)
(c)
Figure 14: The STG for Example 10 shown at intermediate scheduling stages Example 10: The working of our scheduling algorithm is illustrated using the CDFG representation for the greatest common divisor (GCD) example [28], shown in Figure 13. The resource constraints are one subtracter, Sub1, one equality comparator, Eq1, one comparator, Comp1, one NOT gate, Not1, and one OR gate, Or1. Sub1, Eq1, Comp1, Not1, and Or1 execute in one cycle. The chains, Not1 and Or1, and Eq1 and Or1, of functional units are also assumed to take one cycle. c1 is assumed to evaluate to true (false) with a probability of 0.8 (0.2), and c2 is assumed to evaluate to true (false) with a probability of 0.05 (0.95). The algorithm starts with the creation of an initial state, S0, and the identification of operations which can be scheduled initially. For the CDFG shown in Figure 13, it can be seen that the first occurrence of operation ≥ 1 does not have any data or control dependencies, and can, therefore, be scheduled. Hence, it is added to the set, S, of initially schedulable operations. Our scheduler is capable of supporting implicit unrolling of loops. Therefore, operations from several iterations can potentially be scheduled in the same state. We, therefore, annotate operations embedded in loops with a loop iteration number. For example, ≥ 1 0 represents the zeroth iteration of the loop. We note that the data dependencies of operations ?1 0 and ?2 0 have been resolved, but their control dependency (c1), 16
has not. These operations, appropriately conditioned, are, therefore, candidates for speculative scheduling. Thus,
the set of initially schedulable operations, Sched succ[S0] = f≥ 1 0, ?1 0/c1 0, ?2 0/c1 0g. The initial state, S0, is enqueued (step 2) and subsequently dequeued (step 3). At this stage, no control dependencies have been resolved. Therefore, all operations in Sch op are put into a single partition, S part, and a new state, S1, is created to schedule these operations (step 5). The algorithm now selects operations to execute on the available resources (step 6). Operation ≥ 1 0 executes on functional unit Comp1, and operation ?1 0/c1 0 is chosen to execute on functional unit Sub1. Figure 14(a) shows
a snapshot of the STG at this stage of the algorithm. The schedulable successors of operation ≥ 1 0 are !1 0, and ?1 0, if c1 0 = 1, and ?2 0 if c1 0 = 0. Operations ?1 0 and ?2 0 have already been included in the set, S part
of schedulable operations as ?1 0/c1 0 and ?2 0/c1 0. Therefore, !1 0 is the only addition to S part. Operations 6= 1 0/c1 0, ≥ 1 1/(c1 0 ∨c2 0), ?1 1/((c1 0 ∨c2 0) ∧c1 1), and ?2 1/((c1 0 ∨c2 0) ∧c1 1) (schedulable successors of operation ?1 0/c1 0) are also added to S part (step 7). The interplay of loop unrolling with speculative execution, displayed here, is worth noting. Though the pro-
cedures for loop unrolling and speculation are essentially independent of each other, they complement each other perfectly. Unrolling essentially expands the CDFG by differentiating between different copies of operations, and speculative execution automatically explores opportunities to schedule the unrolled operations, before their control and data dependencies have been resolved. At this stage, operation !1 0 cannot fit into the state because of clock cycle constraints, and because the subtracter Sub1 has already been utilized. Therefore, none of the schedulable operations can be fit into the current state, terminating its growth. Hence, step 8 returns a no, and since S1 is not equivalent to any other state in the STG, so does step 12. S1 is then enqueued (step 10), and the set of currently schedulable operations, S part, is stored in the array location Sched succ[S1] (step 9). State S1 is then dequeued, and Sch op is assigned the set f!1 0, ?2 0/c1 0, 6= 1 0/c1 0, ≥ 1 1/((c1 0 ∨ c2 0),
?1 1/((c1 0 ∨ c2 0) ∧ c1 1), ?2 1/((c1 0 ∨ c2 0) ∧ c1 1)g (step 3).
Since c1 0 has been resolved in S1, Sch op is divided into two sets (step 4), which represent the operations in Sch op which execute when c1 0 takes the value
true (S parttrue ) and false (S partfalse). We now describe the evaluation of S parttrue; the evaluation of S partfalse proceeds along similar lines. S parttrue is the set of operations which can execute when c1 0 evaluates to true. We use the procedure outlined in step 2 of Section 4.3 to modify operations conditioned on c1 0 and c1 0. Operation ?2 0/c1 0, which is conditioned on c1 0 evaluating to false, is not included in S parttrue , because the expression
it is conditioned on evaluates to 0 if we substitute c1 0 = 1. Likewise, operation 6= 1 0/c1 0, which was initially conditioned on c1 0 is now conditioned on 1 (since c1 0 = 1). It, therefore, appears as 6= 1 0 in S parttrue . Other operations included in S parttrue are !1 0, ≥ 1 1/c2 0, ?1 1/(c2 0 ∧ c1 1), and ?2 1/(c2 0 ∧ c1 1). To summarize, S parttrue = f!1 0, 6= 1 0, ≥ 1 1/c2 0, ?1 1/(c2 0 ∧ c1 1), ?2 1/(c2 0 ∧ c1 1)g
(6)
A new state, S2, is created to schedule the operations in S parttrue (step 5). Operations ?1 1/(c2 0 ∧ c1 1), ≥ 1 1/c2 0, !1 0, 6= 1 0, and jj1 0 are chosen to execute on functional units Sub1, Comp1, Not1, Eq1, and Or1, respectively (step 6). State growth ends at this time because the allocated resources are exhausted. At this point, S part evaluates to f!1 1/c2 0, ?2 1/(c2 0 ∧ c1 1), 6= 1 1/(c2 0 ∧ c1 1), ≥ 1 2/((c1 1 ∨ c2 1) ∧ c2 0), ?1 2/((c1 1 ∨
c2 1) ∧ c2 0 ∧ c1 2), ?2 2/((c1 1 ∨ c2 1) ∧ c2 0 ∧ c1 2)g. The resolution of c1 1 and c2 0 creates four combinations of conditions. We examine the case when c1 1 = 1 and c2 0 = 1. Using the procedure outlined in step 2 of Section 4.3 yields an S part of f!1 1, 6= 1 1, ≥ 1 2/c2 1, ?1 2/(c2 1 ∧ c1 2), ?2 2/(c2 1 ∧ c1 2)g. We observe that the 17
Table 1: Expected number of cycles, number of states, best- and worst-case number of cycles results Circuit Barcode GCD Test1 TLC Findmin
E.N.C. WS WS-spec 449.8 227.2 95.2 48.5 579.7 80.3 507.0 507.0 522.4 265.0
#states WS WS-spec 8 3 6 6 10 10 4 4 4 3
best-case WS WS-spec 4 1 3 3 1 1 507 507 2 2
worst-case WS WS-spec 502 251 515 260 4050 265 507 507 472 237
elements of S part can be mapped one-to-one to the elements of S parttrue (shown in Equation (6)), such that the difference in iteration numbers of a pair of operations in S part equals the difference in iteration numbers of the corresponding pair in S parttrue . M represents this mapping. This indicates that, if variables are relabeled as indicated by the map, i.e., if variable v i is relabelled as v (i ? 1), state S2 can be considered a valid successor of itself under
the condition c1 ∧c2. The variable relabelings are implemented as register-to-register transfers. Figure 14 (b) shows the STG at this point. The final STG is shown in Figure 14 (c), which can be shown to be obtained at the conclusion of the algorithm.
5 Experimental results The techniques described in this paper were implemented in a program called Wavesched-spec, written in C++. We evaluated this program by using it to produce schedules for several commonly available benchmarks. These schedules were compared against those produced by the scheduling algorithm, Wavesched, without the use of speculative execution, with respect to the following metrics: (a) expected number of cycles, (b) number of states in the STG produced, (c) the smallest number of cycles taken to execute the behavioral description, and (d) the largest number of cycles taken to execute the behavioral description. In general, finding the largest number of cycles taken to execute a behavioral description is a hard problem. However, for the examples considered in this paper, static analysis of the description was sufficient to find the number. Table 1 summarizes the results obtained. The columns labeled E.N.C., #states, best-case, and worst-case represent, respectively, the expected number of cycles, the number of states in the STG produced, smallest number of cycles taken to execute the STG, and the largest number of cycles taken to execute the STG. Minor columns WS and WS-spec represent schedules produced by Wavesched and Wavesched-spec, respectively. We used a library of functional units which consisted of (a) an adder, add1, (b) a subtracter, sub1, (c) a multiplier, mult1, (d) a less-than comparator, comp1, (e) an equality comparator, eqc1, and (f) an incrementer, inc1. Unlimited numbers of singleinput logic gates (OR, AND, and NOT) were assumed to be available. All functional units except mult1, which executes in two cycles, take one cycle to execute. The allocation constraints for an example can be found by looking up the entry corresponding to the example in Table 2. For example, the allocation constraints for GCD are two sub1, one comp1, and two eqc1. The expected number of cycles for a design was measured by simulating a VHDL description of the schedule using the Synopsys VSS simulator. The input traces used for simulation were obtained as zero-mean Gaussian sequences. Of our examples, GCD, Barcode, TLC, and Findmin are borrowed from the literature. Test1 is the example shown in Figure 1. GCD computes the greatest common divisor of its inputs, Barcode represents a barcode reader, TLC
18
Table 2: Allocation constraints for the examples in Table 2 Circuit Barcode GCD Test1 TLC Findmin
add1 2 2 -
sub1 2 -
mult1 4 -
comp1 3 1 1 1 2
eqc1 3 2 1 2
inc1 3 1 1 1
represents a traffic light controller, and Findmin returns the index of the minimum element in an array. The results obtained indicate that Wavesched-spec produced an average expected schedule length speedup of 2.8 over schedules obtained using Wavesched. Note that Wavesched [12] was reported to have achieved an average speedup of 2 over schedules produced by existing scheduling algorithms, such as path-based scheduling [16], and loop-directed scheduling [19]. To get an idea of the area overhead of this technique, we obtained an RTL implementation for the GCD example using an in-house high-level synthesis system, for the schedules produced by Wavesched-spec and Wavesched. These RTL circuits were technology-mapped using the MSU library, and the area of the gate-level circuits were obtained. The area overhead for the circuit produced from Wavesched-spec was found to be 3.1%. We also note that for Wavesched-spec the number of cycles in the shortest and longest paths is smaller than or equal to the corresponding number for Wavesched.
6 Conclusions In this paper, we presented a technique to incorporate speculative execution into scheduling of control-flow intensive designs. We demonstrated that in order to fully exploit the power of speculative execution, one needs to integrate it with scheduling. We introduced a node-tagging scheme for the identification of operations which can be speculatively scheduled in a given state, and a heuristic to select the “best” operation to schedule. Our techniques were fully integrated into an existing scheduling algorithm which can support implicit unrolling of loops, functional pipelining of control-flow intensive behaviors, and can parallelize the execution of independent loops whose bodies share resources. Experimental results demonstrate that the presented techniques can improve the performance of the generated schedule significantly. Schedules produced using speculative execution were, on an average, 2.8 times faster than schedules produced without its benefit.
References [1] E. M. Riseman and C. C. Foster, “The inhibition of potential parallelism by conditional jumps,” IEEE Trans. Computers, vol. 21, pp. 1405–1411, Dec. 1972. [2] M. S. Lam and R. P. Wilson, “Limits of control flow in parallelism,” in Proc. Int. Symp. Computer Architecture, pp. 46–57, May 1992. [3] U. Holtmann and R. Ernst, “Experiments with low-level speculative computation based on multiple branch prediction,” IEEE Trans. VLSI Systems, vol. 1, pp. 262–267, Sept. 1993. [4] K. Wakabayashi and H. Tanaka, “Global scheduling independent of control dependencies based on condition vectors,” in Proc. Design Automation Conf., pp. 112–115, June 1991. [5] U. Holtmann and R. Ernst, “Combining MBP-speculative computation and loop pipelining in high-level synthesis,” in Proc. European Design & Test Conf., pp. 550–556, Mar. 1995. [6] J. A. Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE Trans. Computers, vol. 30, pp. 478–490, July 1981. [7] F. W. Burton, “Speculative computation, parallelism, and functional programming,” IEEE Trans. Computers, vol. C-34, pp. 1190–1193, Dec. 1985.
19
[8] K. Ebcioglu, “A compilation technique for software pipelining of loops with conditional jumps,” in Proc. 20th Annual Wkshp. Microprogramming and Microarchitecture, pp. 69–79, 1987. [9] M. Srinivas, A. Nicolau, and V. H. Alan, “An approach to combine predicated/speculative execution for programs with unpredictable branches,” in Proc. Int. Conf. Parallel Algorithm and Compilation Techniques, 1994. [10] K. I. Farkas, N. P. Jouppi, and P. Chow, “How useful are non-blocking loads, stream buffers, and speculative execution in multiple issue processors,” in Proc. Symp. High-Performance Computer Arch., pp. 78–89, Jan. 1995. [11] J. Tatemura, “Speculative parallelism of intelligent interactive systems,” in Proc. Int. Conf. Industrial Electronics, Control, and Instrumentation, pp. 193–198, Nov. 1995. [12] G. Lakshminarayana, K. S. Khouri, and N. K. Jha, “Wavesched: A novel scheduling technique for control-flow intensive behavioral descriptions,” in Proc. Int. Conf. Computer-Aided Design, Nov. 1997. [13] D. D. Gajski, N. Dutt, A. Wu, and S. Lin, High-Level Synthesis: Introduction to Chip and System Design. Kluwer Academic Publishers, Boston, 1992. [14] P. G. Paulin and J. P. Knight, “Force-directed scheduling for the behavioral synthesis of ASIC’s,” IEEE Trans. Computer-Aided Design, vol. 8, pp. 661–679, June 1989. [15] I. C. Park and C. M. Kyung, “FAMOS: An efficient scheduling algorithm for high-level synthesis,” IEEE Trans. Computer-Aided Design, vol. 12, pp. 1437–1448, Oct. 1993. [16] R. Camposano, “Path-based scheduling for synthesis,” IEEE Trans. Computer-Aided Design, vol. 10, pp. 85–93, Jan. 1991. [17] R. Bergamaschi, S. Raje, I. Nair, and L. Trevillyan, “Control-flow versus data-flow scheduling: Combining both approaches in an adaptive scheduling system,” IEEE Trans. VLSI Systems, vol. 5, pp. 82–100, Mar. 1997. [18] I. Radivojevic and F. Brewer, “A new symbolic technique for control-dependent scheduling,” IEEE Trans. Computer-Aided Design, vol. 15, pp. 45–57, Jan. 1996. [19] S. Bhattacharya, S. Dey, and F. Brglez, “Performance analysis and optimization of schedules for conditional and loop-intensive specifications,” in Proc. Design Automation Conf., pp. 491–496, June 1994. [20] D. Herrmann and R. Ernst, “Register synthesis for speculative computation,” in Proc. European Design & Test Conf., pp. 463–467, Mar. 1997. [21] R. P. Colwell and R. L. Steck, “A 0.6um BiCMOS processor with dynamic execution,” in Proc. Int. Solid-State Circuits Conf., pp. 176– 179, Feb. 1995. [22] K. C. Yeager, “The MIPS R10000 superscalar microprocessor,” IEEE Micro, vol. 16, pp. 28–40, Apr. 1996. [23] P. P. Chang, N. J. Warter, S. A. Mahlke, W. Y. Chen, and W. W. Hwu, “Three architectural models for compiler-controlled speculative execution,” IEEE Trans. Computers, vol. 44, pp. 481–494, Apr. 1995. [24] M. Renfors and Y. Neuvo, “The maximum sampling rate of digital filters under hardware constraints,” IEEE Trans. Circuits & Systems, pp. 196–202, Mar. 1981. [25] D. Messerschmitt, “Breaking the recursive bottleneck,” Performance Limits in Communication: Theory and Practice, Kluwer Academic Publishers, Boston, 1988. [26] M. Garey and D. Johnson, Computers and Intractibility. W.H. Freeman & Company, New York, 1979. [27] R. Jain, A. Majumdar, A. Sharma, and H. Wang, “Empirical evaluation of some high-level synthesis scheduling heuristics,” in Proc. Design Automation Conf., pp. 686–689, June 1991. [28] F. F. Hsu, E. M. Rudnick, and J. H. Patel, “Enhancing high level control-flow for improved testability,” in Proc. Int. Conf. ComputerAided Design, pp. 322–329, Nov. 1996.
20