Abstract 1 Introduction - CiteSeerX

Integrating Floorplanning In Data-Transfer Based High-Level Synthesis

Shantanu Tarafdar Synopsys Inc. 700 East Middle eld Rd. Mountain View, CA 94043

Miriam Leeser Zixin Yin Dept. of Electrical and Computer Engineering Northeastern University Boston, MA 02115

(Submitted to the 1998 International Conference on Computer-Aide d Design)

Abstract

Modern digital systems move and process vast amounts of data. Designing good ASIC architectures for these systems requires ecient data routing and storage. A high-level synthesis (HLS) system must consider spatial aspects of the architecture it synthesizes to achieve this. In this paper, we discuss using oorplanning information in the main HLS ow. Our HLS system, Midas, uses two oorplanners and formulates HLS using a data-transfer model. Midas synthesizes architectures whose data storage and transfer subsystems are spatially integrated with its execution unit. Midas also generates a high-level oorplan for the architecture, which contains the shapes and coordinates of its components and global routing channel speci cations for its buses. Our experiments comparing Midas's architectures to those generated by a HLS system that does not use the data-transfer model or oorplanning show that Midas's architectures are smaller and yet allow for large amounts of simultaneous data motion and storage.

1 Introduction

The interdependence between the physical design characteristics of an application-speci c integrated circuit (ASIC), such as placement and routing, and higher-level design abstractions, such as the ASIC's architecture, is more critical today than ever before due to the demands placed on timing and area by modern applications. ASIC implementations are the most popular solution for the memory-intensive, highthroughput digital systems being designed today. Advances in fabrication technology are allowing these systems to contain an ever increasing number of transistors, making manual management of the design complexity untenable. High-level synthesis (HLS) tools automatically synthesize architectures for an ASIC from a description of the its behavior. They allow

ASIC designers to focus on the functionality of an application, a more manageable level of abstraction, instead of on the details of implementing the architecture. They also allow for a more thorough and rapid architectural design space exploration than would be possible manually. In this paper, we address the importance of physical design characteristics in architecture synthesis. Existing HLS systems tend to ignore the spatial implications of the architectures they synthesize. Areas are estimated as a linear combination of the areas of the individual components of the architecture. Interconnect area is optimized by minimizing the number of buses and treating all data transfers as equal. For modern ASICs, in which interconnect dominates the area, an HLS system can no longer ignore the spatial characteristics of the synthesized architecture. We present a HLS methodology that incorporates

oorplanning information in the core HLS ow. Our HLS system, Midas, uses the data-transfer (DT) model [1, 2] which formulates HLS around data transfers in the behavior and which allows spatial eects to be integrated in the main HLS ow. Midas is unique in its use of oorplanning information in its core HLS steps. Two oorplanners { a global one and an incremental one { provide Midas with the coordinates and shapes of the components in the architecture as well as global routing speci cations for its buses very early in the HLS ow. Midas builds a high-level oorplan at the same time as it synthesizes the architecture. The result is an architecture whose data storage and transfer subsystems show a tight spatial integration with its execution unit. Our experiments show that this methodology yields architectures with smaller oorplans than a methodology that ignores oorplanning. We begin by reviewing some previous work aimed at incorporating oorplanning into the HLS ow. We then examine how the DT-model provides a framework for including spatial eects in synthesis, how it raises the status of data issues in HLS and how Midas uses the DT-model and oorplanning information while synthesizing architectures. Finally, we present our experimental results.

2 Past work While most HLS systems ignore oorplanning aspects of the architectures they synthesize, there are some notable exceptions. Kurdahi et al. [3] discuss the over-simplistic models for ASIC area and timing that HLS tools are forced to use due to the lack of physical design information during architecture synthesis. They propose a layout predictive model for use in high-level design automation tools which accounts for a variety of register transfer level design styles and uses oorplanning. In BUD [4], hierarchical clustering of operations in the design is done. Each cluster is scheduled and allocated. The clusters are then oorplanned using a fast

oorplanner. BUD explores several design alternatives and picks the the one with the smallest oorplan. The IBA (interleaved binder/allocator) HLS system uses a pair of modules called LE and Fasolt [5]. LE performs a more detailed oorplanning that BUD does; the author calls it a compromise between oorplanning and macro-cell placement and routing. Fasolt analyzes the output of LE, searches for certain patterns and, if it nds them, performs optimizing transformations on the architecture. 3D [6] builds up the schedule incrementally. After the operations are scheduled and bound, oorplanning is done. Wiring delays are computed based on the

oorplan. In a post-processing phase, 3D examines pairs of operations that use the same functional unit type to see if swapping the functional units they are bound to reduces wiring delay, and if so, the swap is performed. 3D also tests if adding redundant functional units can improve wiring delay without greatly aecting the ASIC area. BINET [7] performs incremental binding with oorplanning on a scheduled DFG. The oorplanner that is used is based on the Mason oorplanner. BINET solves a network ow problem for binding at each clock cycle of the architecture and then updates the

oorplan. The partial oorplan is used in formulating the network ow problem for the next clock cycle. This continues until binding has been performed for all clock cycles. Moshnyaga and Tamaru [8] describe an HLS system which integrates oorplanning information more deeply in the main HLS ow. Their approach clusters operations based on loop hierarchy and repetitive functional patterns. The clusters are assigned hardware which is then oorplanned. Scheduling follows oorplanning. The system then evaluates whether a prede ned set of moves, if applied to the synthesized architecture, will improve it. If so, the moves are made, the clusters are rede ned and architectural synthesis is done over again. In contrast to the above approaches, Midas tightly integrates the utilization of oorplanning information in the main HLS ow. Floorplanning is not a postprocessing option. Midas uses oorplanning to re ect the cost in silicon utilization of data-transfer, to measure spatial data-transfer congestion, and project more accurate area estimates for the architecture being syn-

thesized. These design metrics guide all aspects of HLS in Midas: scheduling, allocation, and binding.

3 The DT-model

Midas aims at elevating the status of data issues in HLS [1, 2]. It does so by formulating the classic HLS subproblems like scheduling and binding in terms of data-transfers (DTs). In an architecture, data that needs to be stored must be bound to storage elements. Data transfers must be bound to terminals of functional units, ports of storage elements, and interconnect. Typically, these issues are handled as a postprocessing step after the operations in the behavior have been scheduled and bound, but there are advantages to handling them simultaneously especially when

oorplanning information is available. The DT-model in Midas provides a framework in which to do this. In HLS, a data ow graph (DFG) is extracted from the input behavioral description. The nodes of the DFG represent operations and the directed edges represent data dependencies between operations. Operations are traditionally the basic entity in HLS. Scheduling and binding are formulated in terms of operations. Midas raises the complexity of the basic entity to include data. A data-transfer (DT), in Midas, is a set of operations corresponding to the movement of a single instance of data. It contains the operation sourcing the data and all the operations using the data. In an unscheduled DFG, the operation represented by each node with outgoing edges is the source operation of a primary DT and the operations of the node's children are the destinations operations of the DT. An operation can be a member of more than one DT. Figure 1 shows DTs extracted from a DFG. a

b

c

d

a n1

n1

n2

n4

n3

n5

b n1

c n2

n2

n3

DT1

DT2

DT3

n1

n2

n3

n4

n4

n5

d n3 DT4

n5

DT5

DT6

DT7

n4

n5

n6

n6 n6

n6

DT8

DT9

f DT10

f

(a) DFG

(b) Extracted DTs

Figure 1: Extracting DTs from a DFG Data transfers require interconnect in the oorplan. The cost of transferring data can be modelled as the product of the area of the interconnect needed and the length of time the interconnect has to be reserved for the transfer. This is the DT-cost that Midas associates with every DT. The sum of the DT-costs of DTs scheduled in the same cstep is the cumulative DT-cost at the cstep and re ects the interconnect requirement of the architecture. A schedule in which each cstep has roughly the same cumulative DT-cost is one that eciently utilizes the available interconnect in the ASIC.

Figure 2 is a snapshot of the oorplan of an architecture at one cstep in the execution of the behavior. The arrows are DTs and the areas of the arrows represent the relative costs of the DTs. dt_d

dt_a

Register

Adder

Register

Multiplier dt_f

Register dt_b dt_e dt_c

Register

Adder

Figure 2: DT-costs in a cstep during execution The generation of storage requirements is modelled explicitly in Midas as a result of a scheduling directive that all the operations of a DT must be scheduled at the same cstep. If all the operations of a primary DT are scheduled in the same cstep, the data generated by the source is consumed immediately and does not need to be stored. Sometimes, however, it is impossible to schedule all the operations of a DT in the same cstep due to resource or timing constraints. In that case, the primary DT must be partitioned into a set of secondary DTs, each of which can meet the scheduling directive. The operations of the primary DT are distributed among the secondary DTs. When DT partitioning takes place, it implies the data represented by the primary DT must live across a cstep boundary and therefore needs to be stored in a storage element, like a register le, in the architecture. Each secondary DT generated corresponds to a read or write of the data in the storage element and has a storage read or write added to its set of operations. Figure 3 illustrates a general case of DT partitioning. n1

n2

n3

n4

n5

n6

n7 DT1

(a) Original DT

n1

n2

n3

ST WR DT1a

ST RD

n4

ST RD

n5 DT1b

n6

n7 DT1c

increase the size of storage. Each of the secondary DTs formed by partitioning a DT gives rise to a storage access requirement, which can increase the number of ports on storage elements, and interconnect requirement, which can increase the number of buses in the synthesized architecture. In general, DT partitioning is to be avoided since it has the potential to increase storage size, storage bandwidth, and number of buses. When it must be done, minimizing the number of secondary DTs generated will reduce the storage element bandwidths and number of buses. DT partitioning is a core HLS step in Midas since DTs that cannot meet the scheduling directive must be partitioned before they can be scheduled. Therefore, Midas has to consider storage and interconnect in its main HLS ow. Since Midas has oorplanning information available, it can make more accurate projections regarding the impact of DT partitioning on the area of the ASIC. DTs that will have to be scheduled in csteps with already high cumulative DT-costs can be selectively partitioned to allow the secondary DTs generated to be scheduled in csteps with lower cumulative DT-costs. This allows for a more ecient use of the interconnect allocation in the architecture. For a similar reason, DT partitioning need not generate the minimum possible number of secondary DTs. The scheduling directive makes a DT correspond to an actual data transfer that will occur in the synthesized architecture. In addition, a DT neatly encapsulates data broadcasting. This allows Midas to make more accurate predictions regarding actual data accesses and transfers in the schedule which helps to guide Midas's synthesis algorithms. This is dicult to do in the conventional operation-based model for HLS. DT binding is also tightly coupled with oorplanning and includes operation binding and data binding that are typically decoupled in an operation-based model. When a DT is bound, its operations are bound to functional units and the data itself to terminals on those functional units. If the DT is a secondary one, then the data must be bound to a storage element and the storage read or write to a port of the storage element. Finally, the transfer of data must be bound to a bus in the architecture. Binding a DT completely speci es the route of data in the architecture at a structural level. In Midas, oorplanning information is exploited to add a spatial component to the route. Functional units, storage elements, and interconnect are selected to minimize DT-cost which in turn contributes to minimizing ASIC area. This means Midas prefers to bind a DT to components of the architecture that show spatial locality. This leads to short buses. The oorplanner in Midas tends to place newly allocated components so as to contribute to these same objectives.

(b) DT Partition

Figure 3: A partition of a primary DT DT partitioning directly impacts storage and interconnect requirements. The action of partitioning gives rise to a storage requirement for the data. This may

4 Midas Midas performs time-constrained HLS and minimizes ASIC area given an upper bound on the execution time for the behavior. It synthesizes its architectures incre-

mentally in the body of a synthesis loop and uses deterministic algorithms. The hardware model it uses is shown in Figure 4. The execution unit is made up of functional units, the storage unit is made up of distributed, multi-ported register les, and the datatransfer subsystem is a network of buses connected to the components through multiplexers. These components are spatially integrated which means that functional units are placed close to the register les with which they interact and buses tend to be many, but short. Midas outputs both an architecture for the behavior and a high-level oorplan for the architecture.

Input DFG Conversion to DT-model Beginning of synthesis loop: Evaluate timeframes and necessary design metrics Partition DTs Evaluate all Midas design metrics Evaluate all (DT, cstep) scheduling action costs Schedule minimum cost (DT, cstep) pair

DATAPATH

AG1

AG2

AG3

AG4

Storage Architecture = AGs + MPRFs

Bind DT Place DT / Floorplan partial design

MPRF1

MPRF2

MPRF3

MPRF4

C O N T R O L L E R

No

Data Transfer Subsystem = Muxes + Buses

Are all DTs scheduled? Yes

End of synthesis loop Annotate DFG with design information

ADDER

SUB

ABS MAG

Output DFG

MULTIPLIER Execution Unit = FUs

Clock

Key: = Multiplexer

= Bus

AG = Address generator

Figure 5: HLS ow in Midas

MPRF = Multiported register file

Figure 4: Midas's hardware model Figure 5 illustrates the synthesis ow in Midas. The DFG is rst transformed into the DT-domain and then Midas enters its main synthesis loop. Each time through the loop design metrics are evaluated, DTs are partitioned, and a single DT is selected, scheduled and bound. At the end of each loop, the partial design is oorplanned. Midas uses two oorplanners { a global oorplanner and an incremental oorplanner. The global oorplanner completely rebuilds the oorplan. It is used at the end of each synthesis loop to generate a oorplan for the partial architecture. This oorplan is used in the next iteration through the synthesis loop in the evaluation of design metrics and as an input to the incremental oorplanner. The incremental oorplanner modi es an existing oorplan based on changes in the architecture the oorplan represents. It is used to guide the DT binding algorithm. Floorplanning information is used by Midas to compute the projected ASIC area and DT-cost design metrics. Projected ASIC area represents an estimate of the area of the nal architecture. Midas in ates the existing partial oorplan to account for projected resource utilization that is not yet part of the partial architecture. The area of the in ated oorplan is the projected ASIC area. The area component of the DTcost of a DT that has already been bound is the area of the bus to which it is bound. The global routing speci cation for the bus in the partial oorplan provides this information. For an unbound DT, Midas envisages a worst-case scenario where the DT must span the oorplan. Midas uses half the perimeter of

the partial oorplan as the length of the required bus for an unbound DT and uses it to project the DT-cost. DT partitioning considers cumulative DT-cost when deciding whether to partition a DT and how to partition it. If the average cumulative DT-cost over csteps in the schedule at which a DT can be scheduled is high, the DT is partitioned to avoid DT congestion in any part of the schedule. The DT is partitioned so that the average cumulative DT-cost for each of the secondary DTs is low. DT selection and scheduling proceeds with an evaluation of every possible scheduling action. Each DT has a timeframe in the schedule over which it is possible to schedule the DT. Midas evaluates a scheduling action cost function for every cstep in this timeframe. The DT and cstep pair whose scheduling action cost is lowest is selected for scheduling. Projected ASIC area and cumulative DT-costs are important components of the scheduling action cost function. The binding algorithm in Midas uses a branch-andbound heuristic to bind a DT to functional units, storage elements, terminals, ports, and a bus. The cost function guiding the branch-and-bound search is the area returned by the incremental oorplanner as a consequence of the binding actions applied to the existing

oorplan. The net eect is a tight spatial integration of the execution unit, storage unit, and data-transfer subsystem of the architecture synthesized. Resource allocation is bundled in the binding algorithm. In addition to having the option of binding a DT to available resources, Midas presents the binder with the choice to introduce a new resource. The incremental

oorplanner can tell if doing so will result in a lower overall oorplan area due to a reduction in intercon-

nect area. This enables the binder to trade redundant components for area, something that is impossible without oorplanning information.

5 Global oorplanner

The global oorplanner is based on McFarland's Fast Floorplanner [9]. The inputs to the oorplanner are a list of blocks in the architecture, a list of possible shapes for each block, a list of buses in the architecture, and a list of blocks that each bus is connected to. A block can be a functional unit or a register le. Multiplexers in the architecture are assumed to be bundled into the shape list for the block to which they are connected. The oorplanner outputs the coordinates and selected shape of each block and a routing speci cation for each bus.

[A,B]

[A,B,C,D,E] Cut1 Dir: Vert [C,D,E]

Cut2 Dir: Horiz [B]

[A]

[C]

Cut3 Dir:Horiz [E]

imizes the number of buses that connect blocks on the two sides of a cut. This has the eect of placing interconnected blocks close to one another on the oorplan. For each cut, the oorplanner decides which buses must be broadcast over the cut based on the whether the bus is entirely contained on one side of the cut or not. Buses that must be broadcast are allocated global routing channels. A shape is selected for each block in the oorplan so as to minimize the overall area of the oorplan. This is done by propagating combinations of possible shapes for the blocks upwards from the leaves of the slicing tree [9]. When the possible shapes of two sides of a cut are combined, additional area is added to the combined shapes to account for global routing channel allocations at the cut. Once the shapes have been computed, the oorplanner traverses the tree from top to bottom, computing the coordinates of each block and each routing channel. The result is block placement and bus placement in the form of the shape and coordinates of each block and the global routing channel speci cations for each bus. Figure 7 shows a possible oorplan for the slicing tree in Figure 6.

[C,D] Cut4 Dir: Vert [D]

E

A

(a) Slicing Tree bus1

bus5

bus2

bus2

Cut1

bus1 bus3

[A]

C

Cut3 [C]

[D]

bus6

bus4

[E]

D

B

Cut2 [B]

Cut4 (b) Relative position of blocks in floorplan

Figure 6: The relation of a slicing tree to the oorplan The global oorplanner uses a slicing-tree mechanism. The set of blocks is bipartitioned recursively until each set contains only one block. The sets formed in the bipartitioning sequence form the nodes of the slicing tree. Each of the sets formed during recursive bipartitioning corresponds to a section of the ASIC

oorplan. The two sets formed by any bipartitioning action are said to lie on opposite sides of a \cut" on the oorplan. Figure 6 illustrates a slicing tree and its relation to a oorplan. The oorplanner uses the Kernighan-Lin [10] bipartitioning algorithm and min-

Figure 7: A sample oorplan The global oorplanner is called at the end of Midas's synthesis loop and is used to update the partial

oorplan for changes made in the architecture during an iteration of the synthesis loop. The oorplan generated by the global oorplanner at the end of the last iteration of the synthesis loop becomes the highlevel oorplan output by Midas. The partial oorplan generated by the global oorplanner is used to evaluate the design metrics that guide DT partitioning, scheduling, and binding in Midas. The area of the partial oorplan is used as the starting point in computing the projected area of the nal ASIC architecture. The area of a bus in the oorplan is available from the global routing channel speci cation and is used as the area component of the DT-cost of DTs bound to the bus. The cumulative DT-cost is used to compute components of the hierarchical scheduling action cost function and to guide DT-partitioning.

The partial oorplan is also used as an input to the incremental oorplanner that is used by the DT binding algorithm. The partial oorplan generated by the global oorplanner is integral to the HLS ow in Midas and is critical to Midas in its synthesis of architectures whose components are spatially integrated.

6 Incremental oorplanner The incremental oorplanner modi es a oorplan generated by the global oorplanner to accomodate changes in the architecture since the last time the global oorplanner was run. Changes include newly allocated blocks and buses, changes to the shapes of blocks due to register le or multiplexer in ation, or terminals added to existing buses. The incremental

oorplanner operates on the slicing tree generated by the global oorplanner. It rst replaces all the blocks in the slicing tree that have been changed and updates information at all cuts to re ect any changes to the connectivity of existing buses or any new buses. Finally, each newly allocated block is placed in the slicing tree to minimize the increase in buses that cross cuts. After the slicing tree has been updated, bus broadcasting, shape generation and coordinate computation is performed in the same way it was in the global oorplanner. The incremental oorplanner is used in Midas's DT binding algorithm to guide the branch-and-bound search. In the course of the search, binding actions are evaluated. Each binding action can change the shape functions of components in the architecture or the connectivity of buses. These changes are input to the incremental oorplanner along with the partial

oorplan from the previous iteration of the synthesis loop. The area of the incremental oorplan returned guides the branch-and-bound search. The DT binding selected is the one that results in the minimum

oorplan area. The incremental oorplanner is critical to Midas's resource allocation strategy. Unlike many other HLS systems, Midas does not allocate a minimum number of resources. In contrast, Midas allocates resources to minimize oorplan area. Sometimes adding extra resources can lead to lower interconnect requirements whose net eect is to reduce the overall area of the

oorplan. Minimizing the number of functional units, storage elements, and buses typically leads to buses that are long, that have many terminals and consume large areas of the oorplan. The incremental oorplanner allows Midas to trade extra resources for a decrease in overall area.

7 Incremental versus global

oorplanning The global and incremental oorplanners serve very dierent functions and are therefore designed very differently.

The global oorplanner generates the partial oorplan that is used to evaluate design metrics that guide DT partitioning, scheduling, and binding in subsequent iterations of the synthesis loop. It is important that it generates a good oorplan that converges to the nal high-level oorplan. It is called only once in each iteration of the synthesis loop and, therefore, the quality of its result is more important than its speed. The incremental oorplanner is used to provide a one-step lookahead function to guide the binding algorithm. It is called once for every node in the binder's branch-and-bound search tree. It is therefore imperative that the incremental oorplanner be fast. Since the incremental oorplanner's output is not used as the partial oorplan, its quality is not as critical as the

oorplanner's speed as long as the quality is not excessively poor. The global oorplanner is called after binding has been performed to \refresh" the oorplan. While the incremental oorplanner uses a simple greedy heuristic, we nd that for the one-step lookahead it needs to perform, it is adequate. In most cases, the area of the oorplan generated by the incremental

oorplanner during DT binding matches the area of the partial oorplan obtained during the subsequent run of the global oorplanner. The actual topologies of the oorplans may dier but the areas are usually very close. But this happens only when the changes input to the incremental oorplanner are few and small, as happens during one-step lookahead. If the incremental oorplanner is used to generate the complete

oorplan, we see that in most cases the global oorplanner generates oorplans with much lower areas, as expected. The combination of incremental and global oorplanners in Midas allows oorplanning to be integrated in the HLS ow without sacri cing speed or quality. The DT-model used by Midas exploits this spatial information and guides the synthesis algorithms to the spatially integrated datapath architectures Midas seeks to generate.

8 Results We tested Midas by synthesizing architectures and

oorplans for each of ve input DFGs: an arithmetic expression, a biquad lter, the dierential equation solver benchmark, the elliptic lter benchmark, and a motion estimator. For each input, we varied the maximum allowed length of the schedule between its minimum possible and 30 control steps. The measure of merit for a synthesized architecture was the area of its high-level oorplan. In order to provide a control against which to evaluate Midas, we constructed an alternate HLS system as a competitor for Midas. This HLS system used the same scheduling algorithm as Midas except it did not use the DT-model or oorplanning information. Binding and oorplanning were secondary steps. Register allocation and merging minimized the number and sizes of register les needed in the architecture. Functional unit allocation minimized the numbers of

Midas Competitor

140000

120000

100000

Area

each type of functional unit. Operation and variable binding were done with the aim of reducing the number of inputs to the multiplexers at the terminals of functional units and ports of register les in the architecture. Bus allocation and data-transfer binding were done to minimize the number of buses and the number of terminals on each bus. The nal step was

oorplanning, using the global oorplanner.

80000

60000 45000 Midas Competitor

40000

40000 20000

35000

0

30000

10

12

14

16

18

20

22

24

26

28

C-Steps 25000 Area

Figure 9: Results of motion estimator runs

20000

15000

14000 Midas Competitor

10000 12000 5000 10000 0 16

20

24 C-Steps

Figure 8: Results of elliptic lter runs Figure 8 shows the results of the runs on the elliptic lter benchmark. In the graph one of the lines represents the areas of the architectures synthesized by Midas. The other line corresponds to architectures synthesized by the competitor. As can be seen, Midas performs better consistently. A closer examination of the oorplans synthesized showed that Midas opted for more buses than the competitor system did, but ensured they were short ones. In addition, Midas traded a larger storage unit for shorter data-transfers. The buses in the architectures generated by the competitor system were few, but had many more terminals than the buses in Midas's designs. The sum of the areas of the storage elements was smaller than the sum of register le areas in Midas, but extra routing was required to reach them. The execution units synthesized by both systems had identical numbers of components. Midas was able to exploit spatial locality between storage and functional units while scheduling and binding. The synthesis runs on the elliptic lter benchmark clearly showed the advantages of oorplanning information used early in HLS and underlined the potential of Midas's approach. Figure 9 shows the results of runs on the motion estimator. Even though the results are not as dramatic as the ones obtained for the elliptic lter benchmark, we see Midas performs better for architectures with smaller schedule lengths. We observed this trend in architectures for all of the other DFGs. Midas consistently synthesizes smaller oorplans when the schedule length is constrained to be very short. Figure 10 shows the results of runs on the arithmetic expression example in which we see the same trend. Table 1 shows the resource counts and oorplan areas for architectures with dierent schedule lengths

Area

8000

6000

4000

2000

0 6

7

8

9

10

11

12

13

C-Steps

Figure 10: Results of arithmetic expression DFG runs that were synthesized for the elliptic lter example by both Midas and the competitor. The area of I/O pads, adders and multipliers in our technology library tend to be smaller than the register les and interconnect that the architectures require. In most of the architectures, we see that Midas has higher number of these smaller resources, but a smaller number of register les and buses. Midas's integration of oorplanning information and its elevation of data issues in HLS allows it to make this tradeo early in the HLS ow, thereby reducing oorplan area. In the architectures for the rst two schedule lengths, we see that the only dierence in resource counts is that Midas's architectures have a larger number of buses. It seems counter-intuitive that their oorplan areas are smaller. On a close examination of the oorplans, we found that the buses in Midas's architectures are short ones with half of them being connected to only two components. The overall eect is a reduction in oorplan area. The results indicate that the use of the DT-model and oorplanning enable Midas to generate better

oorplans, especially for behaviors that must execute in short lengths of time. High-throughput, dataintensive applications, the class of applications that

SL I/O Add Mul RF Bus Area 16 M 4 4 2 9 18 29412 C 4 4 2 8 12 42912 17 M 4 4 2 7 11 17760 C 4 3 2 7 10 21402 18 M 4 4 2 6 10 17136 C 3 3 2 8 10 22952 19 M 4 3 2 6 9 13680 C 3 3 2 8 9 22995 24 M 3 2 2 3 8 9165 C 2 1 2 8 7 13860 26 M 2 2 2 4 6 8844 C 2 2 1 7 7 12485 27 M 2 2 2 5 7 11154 C 1 2 1 6 9 11596 Key:

M C SL I/O Add Mul RF Bus Area

= = = = = = = = =

Midas-generated architecture Competitor-generated architecture Schedule length in csteps Number I/O pads Number of adders Number of multipliers Number of register les Number of buses Area of oorplan

Table 1: Architectures for elliptic lter motivated the DT-model and the elevation of the status of data issues in HLS, typically have tight execution time constraints. Our results shows the potential of our methodology in synthesizing ecient architecture for this class of applications. In addition, the availability of oorplanning allows Midas to route data more eciently thereby lowering interconnect area.

9 Conclusions

We have presented a oorplanning approach to HLS in the form of Midas, an HLS system that uses datatransfers (DTs) to synthesize architectures and highlevel oorplans. Midas incorporates oorplanning in the main HLS ow and makes the information oorplanning provides integral to DT partitioning, scheduling, and binding. A global oorplanner performs a slower, but more optimal oorplanning at the end of each synthesis iteration while an incremental oorplanner provides a fast, but acceptably accurate prediction for eects of synthesis actions on the oorplan. The combination of the two allows oorplanning to be integrated into the HLS ow while still ensuring quality and speed. The resulting high-level oorplan contains the shapes and coordinates of functional units and multi-ported register les as well as global routing channel speci cations for buses in the architecture. The integration of two types of oorplanning with scheduling and binding is unique to Midas. In addition, the DT-model used in Midas provides a framework for encapsulating data broadcasts and accesses. This, when combined with oorplanning information, helps Midas model data ow in the synthesized architecture more accurately in early stages of synthesis. Our results show that oorplans generated by Midas tend to be superior. The architectures Midas synthe-

sizes tend to have many short buses instead of a few long ones. They also tend to trade extra components for a reduction in overall area caused by shorter data routes the extra components allow. The oorplanning approach that Midas uses in conjunction with the DTmodel results in lower area oorplans and shows merit especially when synthesizing architectures for designs under tight schedule length constraints.

References

[1] S. Tarafdar and M. Leeser, \The DT-Model: HighLevel Synthesis Using Data Transfers," in Proceedings of the 35th ACM/IEEE Design Automation Conference, 1998. [2] S. Tarafdar, A Data-Transfer Model For High Level Synthesis And Its Application In Storage And Interconnect Optimization. PhD thesis, Cornell University, 1998. [3] F. Kurdahi, D. D. Gajski, C. Ramachandran, and V. Chaiyakul, \Linking Register-Transfer and Physical Levels of Design," IEICE Transactions on Information and Systems, September 1993. [4] M. C. McFarland and T. J. Kowalski, \Incorporating bottom-up design into hardware synthesis," IEEE Transactions on Computer-Aided Design, vol. 9, pp. 938{950, September 1990. [5] D. W. Knapp, \Fasolt: A Program for FeedbackDriven Data-Path Optimization," IEEE Transactions on Computer-Aided Design, vol. 11, pp. 677{695, June 1992. [6] J.-P. Weng and A. C. Parker, \3D Scheduling: High Level Synthesis with Floorplanning," 28th ACM/IEEE Design Automation Conference, pp. 668{ 673, 1991. [7] M. Rim, A. Majumdar, R. Jain, and R. De Leone, \Optimal and Heuristic Algorithms for Solving the Binding Problem," IEEE Transactions on VLSI Systems, vol. 2, pp. 211{225, June 1994. [8] V. G. Moshnyaga and K. Tamaru, \A Floorplan Based Methodology for Data-Path Synthesis of Sub-Micron ASICs," IEICE Transactions on Information and Systems, vol. E79{D, pp. 1389{1395, October 1996. [9] M. McFarland, \A Fast Floor Planning Algorithm for Architectural Evaluation," Proceedings of the International Conference on Computer Design, pp. 96{99, October 1989. [10] W. Kernighan and S. Lin, \An Ecient Heuristic Procedure for Partitioning Graphs," Bell Systems Technical Journal, vol. 49, pp. 291{307, 1970.