approach is a list scheduling (LS) technique where a hardware constraint is specified and the algorithm attempts to minimize the total execution time by using a ...
Scheduling and Binding Algorithms for High-Level Synthesis Pierre G. Paulin Bm, PO. Box 35 11, Sm. “C” Ottawa, Canada KlY 4H7 John P. Knight Carleton Univ., Dept. of Electronics, Ottawa, Canada KlS 5B6
Abstract
- New algorithms for high-level synthesis are presented. The first performs scheduling under hardware resource constraints and improves on commonly used list scheduling techniques by making use of a global priority function. A new design-space exploration technique, which combines this algorithm with an existing one based on time constraints, is also presented. A second algorithm is used for register and bus allocation to satisfy two criteria: the minimization of interconnect costs as well as the final register (bus) cost. A clique partitioning approach is used where the clique graph is pruned using interconnect affinities between register (bus) pairs. Examples from current literature were chosen to illustrate the algorithms and to compare them with four existing systems.
1. Introduction As logic and RTL-level synthesis tools gain a stable foothold in industry, the automatic synthesis of a digital system from a behavioral description - high-level synthesis - is the next step on the ladder of the design automation hierarchy. A tutorial on the subject is given in [McPC88]. The interest in high level synthesis is a natural consequence of the shift of IC designers’ involvement away from device-level considerations and towards architectural ones. In this paper, we will present algorithms that solve two difficult tasks in high-level synthesis, namely scheduling under resource constraints and register/bus allocation to minimize interconnect costs. These algorithms, originally implemented in the HAL system [Pau186], [Pau187], can also be integrated into specialized or general purpose high-level synthesis systems.
2. Scheduling
under Resource Constraints
In the context of high-level synthesis, scheduling [McPC88] consists of determining a propagation delay for every operation of the input behavioral description and then assigning each one to a specific control-step (a c-step is often equivalent to a single state of an FSM). One commonly used approach is a list scheduling (LS) technique where a hardware constraint is specified and the algorithm attempts to minimize the total execution time by using a local priority function to defer operations when resource conflicts occur.
I’ennission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
26th ACM/IEEE
A different approach, force-directed scheduling (FDS) was presented in [Pau187] and reimplemented by other groups in academia [Clou87], [StokSS] and industry [Fuhr88]. In this approach, a global time constraint is specified and the algorithm attempts to minimize the resources required to meet that constraint. This formulation of constraints is useful for DSP applications where system throughput is fixed and area must be minimized. The main strength of the algorithm was the use of a global measure of concurrency to guide the scheduling. As its name implies, the force-directed list scheduling (FDLS) algorithm presented here combines the characteristics and strengths of both force-directed and list scheduling. Like the former, it uses a global measure of concurrency throughout the scheduling process. Furthermore. it can be used when a predefined number of functional units of any type is specified. In the next subsection, we will give a quick review of the main concepts of the FDS algorithm (refer to [Paul871 or [Paul891 for details) while focusing on the common elements of the new FDLS algorithm presented further. 2.1 A review of Force Directed
Scheduling
The intent of the force-directed scheduling algorithm is to reduce the number of functional units, registers and buses required, by balancing the concurrency of the operations assigned to them, but without lengthening the total execution This is achieved using the three step algorithm time. summarized below. Determination of time frames: The first step consists of determining the time frames of each operation by evaluating the ASAP (as soon as possible) and ALAP schedules. This determines the possible time frames for each operation, as depicted in Fig. 2.1 for the differential equation (DiffEq) example of [Pau187]. The width of the box containing a particular operation represents the probability that the operation will be eventually placed in a given time slot. Uniform probabilities are assumed. Creation of Distribution Graphs: The next step is to take the sum of the probabilities of each type of operation for each c-step of the CDFG. The resulting distribution graphs (DGs) indicate the concurrency of similar operations. For each DC the distribution in c-step i is given by: DC(i) = c Prob(Opn, i) (1) Opn type where the sum is taken over all operations of a given type. Using Fig. 2.1, we can calculate the values of the multiplication DC. This yields: DC(l) = 2.833, DG(2) = 2.333, DG(3) = 0.833 and DG(4) = 0. The final step is to calculare the force Force calculation: associated with every feasible c-step assignment of each operation. This is done by temporarily reducing the
Design Automation
Conference@ Paper 2.1
0 1989 ACM O-89791 -31 O-8/89/0006/0001
$1.50
1
operation’s time frame to the selected c-step. The force associated with the reduction of an initial time frame (bounded by c-steps t and b) to a new time frame (bounded by c-steps nt and nb) is given by the following equationl: Force(nt,nb) = ?[DG(i)
i=nt
b / (nb-nt +l)] - c [DG(i) / (b-t +l) 1 (2) i=t
Each sum represents the average of the distribution valuesfor the c-steps bounded by the time frame. The force is therefore equal to the difference between the average distribution for the c-steps bounded by the new time frame and the average for the csteps of the initial one. This force calculation must also be performed for all predessors and successors of the current operation, but only when their time frame is affected. These are defined as indirect forces [Pau187. Pau189]. The resulting force is the sum of the direct and indirect forces. After the calculation of forces of all operations has been performed, we select the operation to c-step assignment with the lowest force - or the best concurrency balancing. Time frames are readjusted and the entire process is repeated until all operations are scheduled. 2.2 Force-directed
list scheduling
The scheduling approach just described supports the synthesis of near-minimum cost datapaths under fixed timing constraints. The FDLS algorithm presented here solves the dual problem; the determination of a schedule with a near-minimal number of c-steps, given fixed hardware constraints. It is based on the well known list scheduling (LS) algorithm [Davi81] as well as the FDS algorithm just presented. Recall that in list scheduling, operations are sort.ed in topological order by using control and data dependencies. The set of operations that may be placed in a c-step may then be evaluated; we call these the ready operations. If the number of ready operations of a single type exceeds the number of hardware modules available to perform them. then one or more operations must be deferred. In previous list scheduling algorithms, the selection of the deferred operations is determined by a local priority function such as mobility [PaGa86] or urgency [GiKn84]. In force-directed list scheduling (FDLS), the approach is similar except that force is used as the priority function. More precisely, whenever a hardware constraint is exceeded in the course of regular scheduling, force calculations are used to select the best operation(s) to defer. Here, the deferral of an operation implies that its time frame is reduced so that it excludes the current c-step. The deferral that produces the lowest force - i.e. the lowest global increase of concurrency in the graph - is chosen. This is repeated until the hardware constraint is met. Typically, the hardware constraint is given on the number of functional units. However, this principle can also be applied to data transfer operations when fixed limits on buses are given, and, under certain conditions, to storage operations when a maximum number of registers is specified. Forces are calculated using the method described in the previous section. However, as these calculations depend on the existence of time frames, a global time constraint must be temporarily specified. Here it is simply set to the length of the current critical path. This length is increased when the only
1 This formulation of force is equivalent to the one given in [Pau187]; however, it is more computationally efficient because: 1. It involves mostly additions aand substractions and 2. The initial average distribution need only be calculated once when evaluating the force associated with all c-steps of the time frame.
Paper 2.1 2
way of resolving a resource conflict is to defer a critical operation. The FDLS algorithm is illustrated in Figures 2.1 and 2.2 for the DifJEq example presented earlier. It is assumed here that the user has specified that the datapath may not use more than two multipliers. one adder, one subtractor and one comparator. The minimum time is equal to four c-steps, the length of the critical path. Time frames are determined in the fashion described in the ,previous subsection. Operation chaining is achieved by extending the time frame of operations into the previous or next c-step when the combined propagation delays (added to the 1atc:h delays) of the chained operations are less than the clock cycle. Multi-cycle operations are supported with a straightforward extension [Paul891 of the single cycle methodology presented here. In the first iteration of the algorithm (Fig. 2.1), there are four multiplications and one addition that may be scheduled in c-step one. The addition is scheduled immediately. As for the multiplications, there are two on the critical path. By definition, the force associated with an attempted deferral of a critical operation is equal to infinity. The deferral of the third or fourth multiplications will necessarily imply a lower force and will therefore have c-step 1 removed from their time frames, as shown in Fig. 2.2 (a). This process is repeated for the next two c-steps as shown in Fig. 2.2 (a) and (b), yielding the final schedule. 1
-.* c-sap
1
...
II2
00
II3
...
....
x . .
-a* c-step2
..
X X
c-smp3
G.
-
+
..+
FDLS Algorithm:
first iteration.
x
1 2
x
3 4 -
Fig. 2.2.
(a) FDLS Algorithm:
(b) second and third iterations.
The FDLS algorithm can be summarized as follows: 1. Initialize time constraint to length of critical path 2. for c-step from 1 to time constraint do: 2.1- Determine time frames 2.2 - Determine ready operations in c-step (opns. whose time frame intersects current c-step) 2.3 - while ( no. of ready opns. > no. of FUs) do: l if all opns. on critical path then l extend time constraint by 1 c-step l re-evaluate time frames l Calculate forces for possible deferrals l Defer operation with lowest force l Remove it from ready opns. end; 2.4 - Schedule remaining ready opns. in current c-step end;
+
is used to take this type of pipelining into account. The FDS and FDLS algorithms both obtained optimal results with respect to FU costs. As we will see in Section 4, register and interconnect costs compare favorably with the results of other systems.
Using this approach, the advantages of both types of scheduling are maintained: 1. High utilisation of functional units, as this feature is intrinsic to list scheduling. 2. Low computational complexity; the FDLS algorithm has a worst case complexity of O(n2) - n is the number of operations in the CDFG - and typically exhibits linear behavior. 3. Global evaluation of all the side-effects of an attempted operation to c-step assignment. This characteristic is the basis of force-directed scheduling.
2.4 A new design-space exploration technique Taken alone, the new FDLS algorithm allows the user to partially specify a target architecture by setting the number and type of functional units, as well as limits on the total register and bus counts. This flexibility, added to the algorithm’s effectiveness, justifies the relatively small effort required to implement it. However, from these and other experiments, we have found that the most powerful method of exploring the design space is by making use of both the FDS and FDLS algorithms. In a first phase, the designer sets a maximum time constraint and uses the FDS algorithm to arrive at a near-optimal allocation. This allows to take advantage of the fact that the FDS algorithm automatically performs tradeoffs between functional units of different types and costs [Pau187, Pau189]. In a second phase, the designer can then focus on that area of the design space by using the FDLS algorithm with the resulting allocation to determine if a faster schedule can be obtained. The reason for this improvement lies in the fact that the scheduler starts out with more information about the design. Regardless of the method chosen, we have given the designer an added level of flexibility with an integrated scheduling methodology that allows him to explore the design space from two dimensions; area or time. This also fulfills our original intention; to provide general algorithms that can be Better still (from the tailored to specific applications. implementer’s point of view at least), most of the subroutines are common to both the FDS and FDLS algorithms.
2.3 Scheduling Examples We will use the fifth-order
elliptic wave filter [KuWK85] that was chosen as a benchmark for the “1988 Workshop on HighIn the first row of Table 2.1 Level Synthesis” [DeBoSS]. below. we summarize the adder and multiplier allocations for different timing constraints as obtained from the regular FDS algorithm. In this table, we assume that multipliers require two c-steps for execution and the adders only one. The minimum Using timing constraint for this example is 17 c-steps. retiming, it could be reduced to 16 c-steps, but this transformation is not applied to ensure fair comparisons with other systems. CPU times varied between two and six minutes on a Xerox 1108; a Lisp machine in the medium-low performance range. The second row was obtained by taking the FDS allocations for 17, 19 and 21 c-steps, setting these as a maximum FU limit and running the FDLS algorithm to obtain For the 17 and 21 c-step the shortest execution time. allocations the results were already optimal so the time could not be reduced. However, for the 19 c-step allocation (2 adders and 2 multipliers), the FDLS algorithm produced a schedule requiring one c-step less. This is also an optimal result with respect to functional unit cost. CPU times were significantly faster than those for the FDS algorithm and varied between one and two minutes. The improvement of this result by the FDLS algorithm is mostly due to the fact that we have given more information about the design - i.e. the number and type of FUs - than in the case of the FDS algorithm where only a time constraint is given. The number of force calculations in FDLS is also inferior which explains the reduced CPU times. The optimal 18 c-step schedule was also obtained by researchers at the University of Eindhoven [StokSS] and Karlsruhe [KramSS]. The former use a slightly modified version of the FDS algorithm where only a subset of the force calculations are performed. The recent scheduling algorithm of Karlsruhe’s CADDY system uses time frames and distributions graphs as in FDS. but incorporates a slightly different force function.
;Algonthm\%me FDS [Paul871 FDLS
17 c-steps 18 c-steps 3+, 3x 3+, 2x 3+, 3x 2+, 2x
19 c-steps 2+, 2x
I
2+, 2x
I
3. Register and Interconnect
21 csteps 2+, lx 2+, lx
1 FDS,FDLS 1 3+.2x/’ 13+,1x?’ 1 2+, 1xP 1 (+: adder, x: multiplier, xP:piped multiplier) Table 2.1 FU allocations for different execution times. The third row represents the schedule obtained in [KuWK85] using ASAP scheduling with conditional deferment. The fourth row represents the result obtained from the ‘System Architect’s Workbench’ (SAW) at CMU [Thorn881 using a list scheduling (LS) algorithm. Finally, the fifth row represents the allocations obtained by HAL using a two stage pipelined multiplier. A simple extension [Paul891 of the force algorithm
I
Allocation
Once scheduling and functional unit allocation are completed, the data path allocation can be performed. Two of the most important subtasks are register and interconnect allocation. In the HAL system, they follow the three transformation steps [Paul861 summarized below. The emphasis throughout the process is on the minimization of interconnect costs as represented by multiplexer and bus areas. This emphasis is justified by McFarland’s experiences [McFa87] which show that multiplexing costs seem to have the most significant effect on the overall cost-speed tradeoff curve. All arithmetic and logic 1. Oneration tb FU bindinl: operations are bound to specific FUs using a functional partitioning method described in [PaulSS]. A storage operation is created 2. Storage to register binding: for each data transfer that crosses a c-step boundary. A novel technique [Paul881 used here consists in dividing the variable lifetime into two intervals. The first interval lasts one c-step and is assigned to a ‘local’ storage operation. The remaining c-steps of the lifetime are assigned to the second storage operation. Typically, the two storage operations are assigned to the same register; however, there are many cases when the assignment to different registers will result in lower interconnect costs. This is particularly true when a register merging method such as the one described below is used. Initially, all storage operations are assigned to separate registers. to interconnect binding: Here a temporary 3. Data-transfer binding is performed by creating muxes and connecting them to the input of every register and FU. The muxes are used to form a transfer path to each of their input source objects. Single input muxes are preserved as they might be merged with others to form a bus in a later step.
Paper 2.1 3
3.1 Register Merging In this important optimization step, compatible registers are selectively merged. To illustrate the difficulty of the register merging problem, we will use the DiffEq example presented earlier. The first step is to determine for each register defined, This yields the register the set of disjoint registers. compatibility graph of Fig. 3.2 shown further, where each edge represents the fact that two registers are disjoint and could possibly be merged. Exhaustive clique partitioning could be used to generate all possible register groupings. However, this is an NP-complete problem which forces us to explore other avenues. As presented by researchers at USC Left-edge algorithm: [Kurd87], one possible solution is to exploit the left-edge in this context, guarantees the mmimum algorithm which, number of registers.
edges. We may then perform exhaustive clique partitioning on the reduced graph to generate all possible merges. For each of these, we evaluate the associated interconnect costs and select the one with the lowest combined register and interconnect cost. For the DifjrEq example, the register groups chosen are: ( (R20, R21), (R1’7,R25). (R16, R23). R18, R19, R22 ). As the number of compatible register pairs decreases with each merge, the process is repeated with progressively lower thresholds until no more merges are possible. By lowering the threshold to 3. we obtain the final solution which is a clique made up of five register groups: ( (R20, R21), (R17, R24, R25). (R16, R23). (R18, R19), R22 ). In this case, this is the Perhaps more minimum number of registers attainable. importantly, and as the experimental results will confirm, this is a configuration with an extremely low interconnect cost.
Heuristic cliaue uartitioning: The Facet system [Tsen86] resorts to a special form of clique partitioning to determine a near-minimum number of registers. It incorporates heuristics based on the clique graph structure to prune the graph, therefore reducing the number of possible cliques. The only limitation of these two approaches, is that the repercussions of a specific register merging on interconnect An earlier version of the HAL system costs are ignored. [Paul861 attempted to take interconnect into account indirectly by favoring merges of registers connected to the same functional units. Pfahler [Pfah87] exploits a similar approach. In the subsection that follows, we present a more powerful generalization of this type of technique. Weight-directed cliaue nartitioning: The current HAL system exploits a stepwise refinement mergmg approach which of reduced involves exhaustive clique partitioning compatibility graphs. Here, structural weights similar to the ones used for interconnect allocation in [Midw88], are used to prune the graph. These weights can be determined from the preliminary FU, mux and interconnect bindings performed earlier. Register merges that favor low interconnect costs are given the highest weight2. as depicted in Fig. 3.1. Weight = 4
Weight = 3
Weight = 2
Weight = 1
Sl
v
Fig. 3.1
6 D2
Interconnect
D3
weights
-
-\I
of different
registe=ges.
2 The weight values (1 to 4) are given here for illustrative purposes only. The actual values represem an estimation of the saved interconnect area - this area being evaluated using the cost function presented in [Midw88]. The weights can therefore be positive or negative.
4
3.2 Multiplexer
r - = - Weight L 4 Weight < 4 Compatibility graph for DifTEq example.
merging
The problem of merging muxes (a data transfer element with multiple inputs and a single output) into buses (with multiple inputs and outputs) is relatively close to the register merging problem. One important difference is that mux usage times are discontinuous; a mux created in the method described is assigned to a series of c-steps that are not necessarily contiguous. Algorithms like the left-edge cannot be used for this reason. A clique partitioning method can be used however. We have elected to use a similar approach as that used for register merging. To limit complexity in this case, we resort to a threshold on the number of common inputs between mux pairs instead of the interconnect weights defined for the registers. In this approach, a merge cannot create more than two leveh of buses and/or muxes for each register-FU-register transfer path. This ensures minimum delay through the interconnect paths. The Splicer [Pang881 and SAW [Thorn881 systems allow up to four levels of buses/muxes.
3.3 Discussion
By setting the weight threshold high enough, we can limit the For example, the complexity of the clique graph at will. application of a weight threshold of 4 to the clique graph of Fig. 3.2 yields the reduced graph represented by the dotted
Paper 2.1
Fig. 3.2
The datapaths of Fig. 4.2 and Fig. 4.3 given in the next section show that in addition to achieving low register and interconnection costs, we have also achieved a good structural partitioning of the design. This results from the use of interconnect information to prune the design space because highly connected elements are grouped implicitly. Furthermore, although the two merging algorithms presented are aimed at a general distributed architeture, it is relatively simple to refine them for specific applications. Different weights can be introduced to enforce predefined structural or physical partitions corresponding to a specific architecture. Registers (muxes) within the same partition would be given the highest weight, so that the algorithm would merge these first. Varying the value of the weight will allow for different compromises between the reduction of the total number of interconnect lines and the preservation of the partitions.
4. Experimental
Results
The two examples presented in this section were chosen to allow comparison with the results obtained from other systems. The results, originally presented in [Paul@], were obtained without any fine tuning of the algorithm to the examples. The CPU (Xerox 1108) execution times are for the complete synthesis which includes scheduling, FU allocation, as well as register and bus binding. 4.1
Differential
equation
example
The DiffEq example depicted earlier, was first presented in [Paul861 and used subsequently in [Gebo87], [Brew87], [Pang881 and [Pau187]. The summary of costs for these and the HAL system is given in Fig. 4.1. In the first four columns, non-pipelined functional units are assumed. The first column in the figure corresponds to an early version of the HAL system [Pau186]. This result was improved on by the Splicer [Pang881 and CATREE [Gebo87] systems as indicated in the second and third columns. The fourth column represents the result obtained in the current HAL system using the register and bus merging algorithms described above. Interconnect, register and FU cost percentages (given below the graph) are given relative to the initial result of the early HAL system. The table shows that the interconnect costs have been reduced while keeping the total register count to a minimum.
Summary of Area Costs
q n
Differential equation example
l
Area Cost (time = 400ns)
c]
Interconnect Registers FUs
200
100
0 Swem CPU
q n 0%
HAL ‘86
Splicer ‘88
40s
n/a
%
100
86
a/o
100 100
Catree ‘87
HAL ‘88
n/a
Splicer ‘87
4.2
Fifth-order
elliptic
wave
filter
The CDFG used in this example is borrowed from Kung, Whitehouse and Kailath’s book on signal processing [KuWKSS]. As mentioned earlier, it was chosen as a benchmark for the “1988 High-Level Synthesis Workshop” [DeBo88]. In Table 4.1, the HAL system designs are compared with those of the Catree [Gebo88], Splicer [Pang881 and the CMU SAW [Thorn881 systems. The table includes the number of mux inputs3 required - a crude measure often used to evaluate relative interconnect costs. To help isolate the effect of the different register and interconnect allocation stategies, we compare results with identical time constraints and functional unit For all these examples, and with all other costs allocations. being equal, the interconnect costs are significantly lower in the HAL system. The total CPU time varied between two and eight minutes. The bottom row indicates the overall best result obtained from HAL. It exploits a two-stage pipelined multiplier and has the lowest interconnect cost of all the examples. The datapath for this result is given in Fig. 4.3. The right operand of the pipelined multiplier is a small constant ROM that contains the filter coefficients. Time No. mult. No. (c-steps) No. adders Registers 17 2xP, 3+ HAL I2 CATREE 17 2xP. 3+ 12 19 2x, 2+ 12 SAW 19 2x. 2+ 12 21 lx, 2+ 12 Splicer 1 21 1 lx, 2+ 1 n/a 1 19 1 lxP,2+ 1 12 (+: adder, x: multiplier, x&piped multip Table 4.1 Comparison of register & inter requirements System
No. MUX lnputs 38 (:;2%) 34 (?7%) ‘35 (i”%) ier) :onnect
HAL ‘88
50 s
n/a
93
79~
107
1 84
100
83
a3
100
J
100
100
Fig. 4.1 Cost summary for DiffEq
The use of a two stage pipelined multiplier in the Splicer system [Brew871 allowed for a significant FU cost reduction as shown in the fifth column. This is also true for the HAL result (sixth column), but here the combined use of the force-directed scheduling and weight-directed clique partitioning algorithms led to a solution with extremely low interconnect and register costs, as demonstrated in Fig. 4.2.
120s 83
example,
HAL Datapath (Pipelined Multiplier)
..
Fig. 4.3 Data path for wave filter example There are three observations that can be made from the data path depicted in Fig. 4.3; 1. A relatively small number of buses (6) were used. These buses have mostly local connections to and from them.
I-’
Fig. 4.2 HAL datapath for Difflq
.
(using a piped muItiIplie :r).
3 This value is actually the combined number of inputs to muxes and buses; where a bus is considered equivalent to a mux with multiple outputs.
Paper 2.1 5
path 2. As mentioned earlier, a single register-FU-register never crosses more than two levels of muxes and/or bluses. This helps reduce the clock cycle time. The transfers in the SAW and Splicer designs cross up to four levels. 3. As for the previous example, most of the interconnections are local to the area defined by a single functional unit. The bipartition in SAW is more clearly defined however, and probably would be easier to lay out.
5. Conclusion We have described new algorithms for two important tasks in namely scheduling under resource high-level synthesis; constraints and bus and register allocation to minimize The force-directed list scheduling interconnect costs. algorithm described is based on a well-known list schedluling algorithm but uses a more global force metric as the priority function. Furthermore, the combined use of both the FDS and FDLS algorithms supports a flexible stepwise refine:ment approach to design space exploration. Results from the two algorithms, as well as those obtained from other systems which exploit the same principles, clearly illustrate the effectiveness of the use of time frames, distribution graphs and concurrency balancing - the foundations of the scheduling methodology advocated here. The register and bus allocation approach presented exploits a simple but powerful weight-directed clique partitioning algorithm based on interconnect affinities. This allowed to prune the exploration space while favoring a reduction of interconnect costs through an implicit structural partitioning. Finally, we have emphasized the possibility of exploiting these algorithms in more speciaiized synthesis environments. This emphasis hinges on the successes of systems like Cathedral-II [DeMa86] and YSC [Camp871 that are geare.d to a relatively narrow range of applications. Future Work There are still some issues that need to be addressed in the near future. The first that comes to mind is the necessity to incorporate preliminary floor planning information into the synthesis process. Fortunately, this can be achieved using a mechanism similar to the one described in section 3.3 - i.e. assigning higher weights to merges of registers (buses) that are in the same floorplan partition. On the other hand, the whole issue of control costs [Sauc87] has been largely ignored. Although balancing the concurrency of data path events will tend to reduce the number of control lines and minimizing the schedule length u:sually reduces the controller size, more accurate metrics for c:ontrol cost still have to be developped. I Acknowledgements I would like to thank Jenny Midwinter for her insight and advice with respect to interconnect allocation and Emil G,irczyc who helped lay the groundwork of the original FDS algoreithm. This research was funded in part by grants from NSERCC and from BNR, Ottawa.
6. References [Brew871 F.D. Brewer, D.D. Gajski. “Knowledge-Based Control in Micro-Architecture Design”, Proc. of 24th Design Aurtomation Conference, July 3987, pp. 20.3-209. “Structural Synthesis in the [Camp873 R. Camposano, Yorktown Si Compiler”, VLSI ‘87, Aug. 1987, p. 29. [Clou87] R. Cloutier, Private Communication. Nov. 1987. [Davi81] S. Davidson et al, “Some Experiments in Local Microcode Compaction for Horizontal Machines”, IEEE Trans. on Computers, C-30, 7, Jul. 1981, pp. 460477.
Paper 2.1 6
[DeBo88] E. Detjiens, G. Borriello (Chairs), “Workshop on High-Level Synthesis”, Orcats Island, Jan.. 1988. [DeMa86] H. De Man et al, “Calthedral-II : A Silicon Compiler for Digital Signal Processing”, IEEE Design & Test Magazine, December 1986, pp. 13-25. [Fuhr88] T. Fuhrman, “High-:Level Synthesis Desig:n of a Real.-Time Control Chip at GM-Delco”, ACM/IEEE HighLevel Synthesis Workshop, Jan. 1988. M.I. Elmasry. “A VLSI [Gebo87j C.H. Gebotys, Methodology with Testability Constr:aints”, Proc. of Canadian Conference on VLSI, Winnipeg, Oct. 1987. [Gebo88] C.H. Gebotys, M.I. Elmasry, “VLSI Design Synthesis with Testability”, Proc. of the 25th Design Automation Conference, June 1988, pp. 16-21. [GiKn84] E.F. Girczyc and J.P. Knight, “An ADA to Standard Cell Hardware Compiler Based on Graph Grammars and Scheduling”, Proc. of ICCD, Oct. 1984, !pp. 726-731. [Kram88] H. Kramer et al, “DaLa Path and Control Synthesis in the CADDY System”, Proc. of Intl.Workshop on Silicon Compilers, Grenoble, France, May 1988. F.J. Kurdahi and A.C. Parker, “REAL: A Program for [Kurd871 Register Allocation”, Proc. of the 24th DAC, Miami, July ‘1987, pp. 210-215. [KuWK85] S.Y. Kung, H.J. Whitehouse, T. Kailath, “VLSI and Modem Signal Processing”, Prentice Hall, 1985, pp.258264. [McFa87] M.C. McFarland, “Reevaluating the Design Space for Register-Transfer Hardware Syntlhesis”, Proc. of ICCAD, Nov. 1987, pp.262-265. [McPC88] M.C. McFarland, A.C. Parker, R. Camposano, “Tutorial on High-Level Synthesis”, Proc. of the 25th Design Automation Conference, July 1988, pp. 330-336. [Midw88] J. Midwinter, “Im,proving Interconnect for the Behavioral Synthesis of ASICs”, M.Sc. Thesis, Carleton University, April 1988. [PaGa86] B.M. Pangrle, D.D. Gajski, “State Synthesis and Connectivity Binding for Microarchitecture Compilation”, Proc. of ICCAD, Nov. 1986, pp. 210-213. [Pang881 B.M. Pangrle, “Splicer: A Heuristic Approach to Proc. of the 25th Design Connectivity Binding”, Automation Conference, July 1988, pp. 536-541. P.G. Paulin. J.P. Knight, E.F. Girczyc, “HAL: A [Paul861 Multi-Paradigm Approach to Automatic Data Path Synthesis”, Proc. of 23rd DAC, July 1986, pp. 263-270. [Paul871 P.G. Paulin, J.P. Knight, “Force-Directed Scheduling in Aut0mati.c IData Path Synthesis”, Proc. of 24th DAC. Miami, July 1987, pp. 195-202. P.G. Paulin, “High-Level Synthesis of Digital [Paul881 Circuits Using Global Scheduling and Binding Algorithms”, Ph.D. thesis, Carleton Univ.. Feb. 1988. [Paul891 P.G. Paulin. J.P. Knight, “Force-Directed Scheduling for the Behavioral Synthesis of ASICs”, IEEE Transactions on CAD of ICs and Systems, Vol. 8 (6), June 1989 (projected publication date). [Pfah87] P. Pfahler, “Automated Data Path Synthesis: A Compilation Approach”, Euromicro Journal of Microprogramming, Vol. 21, 1987, pp. 577-584. [Sauc87] G. Saucier, M. Crastes de Paulet, P. Sicard. “ASYL: A Rule-Based System for Controller Synthesis”, IEEE Trans. on CAD, Vol. CAD-6, Nov. 198’1, pp. 1088-1097. L. Stok, R. van den Born, “EASY: Multiprocessor [Stok88] Architecture Optimisation”, Proc. of lntl. Workshop on Silicon Compilers, Grenoble, France, May 1988. [Thorn881 D.E. Thomas et al, “The System Architect’s Workbench”, Proc. of 25th DAC, July 11988, pp. 337-343. C. Tseng. D.P. Siewiorek, “Automated Synthesis of [Tsen86] Data Paths in Digital Systems”, IEEE Trans. on CAD of ICs and Systems, July 1986, pp. 379-395.