Datapath-oriented FPGA Mapping and Placement for Configurable Computing -Extended Abstract-
Timothy J. Callahan and John Wawrzynek University of California-Berkeley
Widespread acceptance of FPGA-based reconjigurable coprocessors will be expedited tfcompilation time for FPGA conjigurations can be reduced to be comparable to software compilation. This research achieves this goal, generating complete datapath layouts in fractions of a second rather than hours.
Our Approach
For the reconfigurable coprocessor in the Garp chip [2] being developed in our group, we designed GAMA, the Garp Mapper. GAMA generates regular bitslice datapath layouts, optimized with multiple operations packed into a single row of the reconfigurable array when possible. The technique used in GAMA is based on the fact that packing operations into rows of the reconfigurable array is analogous to the problem of instruction selection in compilers (Figure 1). In instruction selection, the operations from the intermediate representation are packed into instructions from the target machine’s instruction set; there may be many alternatives, especially in the case of nonorthogonal complex instruction sets. Our task is similar, except rather than packing operations into a complex instruction, we are packing them into a module - a configuration of one or more rows of the array, implementing one or more operations.
Our algorithm, adapted from instruction selection in compilers, packs multiple operations into single rows of CLBs when possible, while preserving a regular bit-slice layout. Furthermore, placement and thus routing delays are considered simultaneously with packing, so that the total delay, not just the CLB delay, is optimized. The
Problem
Reconfigurable coprocessors, most commonly implemented with field programmable gate array (FPGA) technology, have been shown effective in accelerating certain classes of applications. Computation-intense kernels can be selected automatically or by hand for acceleration using the coprocessor. For each kernel selected, aconfiguration must be constructed to implement the computation graph of the kernel, which typically includes multi-bit integer operations and values. Unfortunately, deriving this configuration using vendorsupplied tools is often a time-consuming process, especially if user assistance is needed for floorplanning or routing. The situation is only made worse by dynamic reconfiguration, since in this case there may be a very large number of kernels for which configurations must be constructed. The resulting long compilation times will likely prevent reconfigurable coprocessors from being accepted as general-purpose computing platorms. Even in the case of embedded computing, long compilation times will hinder extensive exploration of the hardware/software design space.
Figure 1: Similarity between instruction selection and module packing for LUT-based arrays. Multiple operations arepacked into one instruction (top), or into one module (bottom).
Furthermore, current vendor tools are not able to effectively optimize regular datapath designs [ 11. In one common approach, each operation is implemented as a separate hard macro; no optimization is performed across macros, leading to undemtilization of computational resources. In another approach, the design is flattened to gates, optimized, and mapped to CLBs, which then placed independently; regularity is discarded in the process, so that a dense, regular bitslice layout rarely results.
Both of these are examples of tree covering problems - an input tree must be covered with small trees from a pattern library (an instruction set in the case of instruction selection, or a module library in our case). There may be many possible covers; the goal is to find the best cover. For instruction selection, the best cover is the one requiring the fewest cycles to execute; static code size may also be used as a metric. For our module selection problem, we must instead consider area and/or critical pa&delay.
This work is supported in part by DARPA grant DABT63-C-0048, ONR grant NOOO14-92-J-1617, and NSF grant CDA 94-01156. Authors canbe contactedat
[email protected].~~~ and
Although there are differences between the problems, they are similar enough that we are able to leverage algorithms and tools developed for instruction selection and retarget them to
[email protected].
O-8186-8159-4/97$10.00 O1997IEEE
234
the problem of module selection. Specifically, we modified lburg, a code generator generator used in constructing the ICC compiler [3]. The dynamic programming algorithm effectively considers all possible tree covers and finds the cover that is optimal with respect to the available library, iin time that is linear in the number of nodes in the input tree. DAGON [4] uses a similar approach to the analogous problem of technology mapping for Boolean circuits. Like DAGON, GAMAmust split directed acyclic graphs (DAGs) into trees before applying the tree covering algorithm. Unlike DAGON, GAMA will consider duplicating shared subtrees when performing the splitting, since that sometimes leads to faster and smaller configurations.
Figure 2: GAMA placement. Both copies of module C are considered, with each implying a dtfferent placement; the one resulting in the better critical path delay is selected.
GAMA can optimize solely for delay or for areal, or for a weighted average of these. However, GAMA’S most effective optimization strategy is to optimize for delay on critical paths while optimizing for area off of critical paths. A preliminary pass is used to estimate the criticality of each path.
milliseconds to 2.7 milliseconds per node running on a 40MHz Spare IPX workstation. To put this in perspective, we measured the time required for GAN~Ato partition and place all potential kernels extracted from a C source file and compared it to the time required to compile the source fileusing gee (the times for GAMA do not include process startup overhead or the time for kernel extraction or module generation). The results, shown in Table 1, show that the additional time to compile to a processor with a reconfigurable coprocessor is qualitatively similar to the time required to compile to standard processors.
When performing the tree covering, GAMA only needs to know the sizes and the delays of the modules; there is no need to know the implementation details. Each module as determined and placed by GAMA is actually generated in a later step according to the operand widths, operation or group of operations to be performed, and any constant operand values. Integrated
Placement
Source # lines
Routing delays can contribute significantly to the overall delay in a circuit implemented using FPGAs. In order to determine routing delays between two modules, the distance between them must be known. GAMA integrates relative placement determination into lburg’s tree covering algorithrn and thus can determine and use inter-module routing delay information at the same time module selection is performed.
deftate.c inflate.c uti1.c regex.c kwsetc
399 723 528 2179 689
#kernels 8 46 3 40 19
exec time (set) gee -02 + GAMA 4.6 + 0.8 7.7 + 3.6 3.6 + 0.2 50.1 + 4.2 8.5 + I.7
Table 1: Additional time requiredfor GAMA to map all kernels in source files, compared to time for software compilation. All times were measured on a 4OMhz SPARCstution IPX.
GAMA uses physical p0stjI.x placement within each tree being mapped (Figure 2). Considering one parent module, the child module trees are each placed contiguously in some order, then the parent module is placed below them in the array. The order of child placement affects the delays between the child modules and the parent module.
References
[l] A. Koch, “Module Compaction in FPGA-based Regular Datapaths,” in Proc. 33rd ACM/IEEE Design Automation Conference, ACM, 1996.
GAMA determines the best child placment order simultaneously with module covering. The module library contains multiple copies of each module, with each copy having different plucemeat annotations on its inputs. The tree covering algorithm will attempt to match all different copies of each module, thus trying each child ordering, and select the best. Bus lengths and thus delays resulting from each possible child ordering can easily be determined since the sizes of the child subtrees are known. The placement annotations in the final cheapest covering completely define all placement within the tree.
[2] J. Hauser and J. Wawrzynek, “Garp: A MIPS Processor with a Reconfigurable Coprocessor,” in Proceedings of IEEE Symposium on FPGAs for Custom Computing Muchines (K. L. Pocek and J. M. Arnold, eds.), (Napa, CA), Apr. 1997.
Speed
[4] K. Keutzer, “DAGON: Technology Binding and Local Optimization by DAG Matching,” in Proc. 24th ACM/IEEE Design Automation Conference, pp. 341-347, ACM, 1987.
[3] C. Fraser and D. Hanson, A Retargetuble C Compiler: Design and Implementation. Benjamin/Cummings, 1995.
Measurements
Empirical measurements confirm that GAMA executes in time linear in the number of nodes in the input graph, from 1.7
235