Document not found! Please try again

Heterogeneous Floorplanner for FPGA - CiteSeerX

6 downloads 0 Views 350KB Size Report
For example, the largest Xilinx device consist of four PowerPC processors ...... 10.1E7. 1.75E7. 2.4E7. 3.27E7. 4.54E7 ρ = 1.50. 2.04E6. 3.3E6. 6.19E7. 9.95E7.
Heterogeneous Floorplanner for FPGA Love Singhal and Elaheh Bozorgzadeh Center for Embedded Computer Systems University of California, Irvine, California 92697-3425 Email: lsinghal,[email protected]

Abstract— The current generations of FPGA comprise of many specialized hardware cores, like embedded processors, multipliers, RAMs and FIFOs, along with the regular arrays of reconfigurable logic. These specialized cores can serve important functionalities at very fast speeds and considerably low power. On any FPGA device, these embedded cores are located at fixed locations only. In any design, the functionalities mapped to these cores could be very tightly integrated with the rest of the applications. This makes the task of floorplanning for the applications with heterogeneous components very difficult. Recently, some researchers have started looking into this problem of heterogeneous floorplanning on FPGA. However, all these work suffer from one fundamental flaw which affects the quality of solutions leading to higher device areas or excessively high runtime. In this paper, we propose a heterogeneous floorplanner for the FPGA, HPlan, which is fast and highly efficient in finding floorplans of variety of resources. We performed our experiments on the MCNC benchmarks for the floorplan with random heterogeneous resource allocations. Each block in the benchmarks is assigned random number of heterogeneous resources. We observed that as the statistical variation in the heterogeneous resource allocations is increased, compared to HPlan, the traditional floorplanner for heterogeneous resources would fail to meet the area constraints in all of the benchmarks. We also present a case study of a real implementation on Xilinx Virtex device. The proposed floorplanner thus provides an efficient way to handle floorplans with large variations in the heterogeneous resources.

I. I NTRODUCTION Modern FPGAs are used for advanced System on chip designs. The FPGAs consist of many embedded soft and hard processors and multiple embedded processing units on a single chip. The designs take advantage of the large amount of reconfigurable logic to implement functionalities in hardware. For example, the largest Xilinx device consist of four PowerPC processors along with fast buses, serializers, deserializers, transceivers, and multiple clock generators. The SoC designs could therefore consist of multi-core processors with specialized hardware units which can be configured. For example, the size of on-chip data and instruction cache or the floating point unit can be configured during design time. The design methodology of implementing SoC designs in FPGA is similar to that in ASIC, however there is one major difference. While the designer can place the SoC components anywhere on the chip in ASIC, the major embedded components are pre-placed

in FPGA. This could lead to severe design quality issues in FPGA SoC design as the placement of these embedded components determine the location of the overall blocks and feasibility and performance of the designs. The designers would need to know whether the design configurations that they have chosen lead to feasible quality placement or not. These decisions are taken early in the design cycle, and thus designers need a tool to determine the quality of design after placement based on the configuration of the design without going all the way to the long and tedious process of placement and routing. In our work, we propose a floorplanner which could evaluate the design cost at early stage. Modern FPGAs consist of various embedded blocks along with regular arrays of configurable logic blocks (CLB). For example, Xilinx Virtex devices consist of multiple RAM blocks, FIFO and multiplier blocks aligned vertically at regular intervals on the device. These embedded blocks implement specific functionalities and help to improve performance and power of the applications. In the overall design flow, synthesis tools map certain components of application to these embedded blocks. The floorplanning tools have to then place the mapped components to the specific locations on the devices. A module in the design could consist of many embedded blocks of different types. The logic (CLB) of the module should be placed in proximity with these embedded blocks to reduce the wire delay. The floorplanning for embedded blocks together with the ordinary logic in FPGA is called heterogeneous floorplanning. Recently, several works have looked into heterogeneous floorplanning for FPGAs [1]–[4]. All of these work consider a module with all its resources as one rectangular bounding box for floorplanning. Thus, if a module has large number of one type of resource say RAM blocks, such as caches, the bounding box consisting of these RAM blocks will be very large and will contain several unused CLBs and multiplier blocks. Feng et. al. [4] uses network flow algorithm to minimize the area of the floorplan for heterogeneous resources in FPGA. It considers uneven shapes of the blocks to find the minimum area of the block containing all its resources. However, the block in their work is still a continuous block and if it contains an excess of one type of resource, then it could have a very large size. This may lead to a huge wastage of available resources on the device as most of bounding boxes will have unused resources of some type or other in them. A floorplanning tool should optimally use the available logic on device for modules that have large variations in their resource

timing constraints after the final place and route of our application, where the traditional floorplanner could not generate a feasible floorplan. II. R ELATED W ORK

Fig. 1. Multi Layer Heterogeneous Floorplanning. The figure shows floorplanning of the three blocks using three different bounding boxes.

types to reduce the overall area. In this work, we propose a floorplanning technique for heterogeneous FPGA resources. Our floorplanner tool places all the resources in a single simulated annealing stage. It does not assume a single bounding box for a module and instead creates multiple bounding boxes for each type of resource of the modules. Figure 1 shows the floorplanning of the three blocks with multiple bounding boxes. The three layers are Block RAM, CLB and DSP layers. The placements in each layer for the three blocks are also shown. In this way, our floorplanner places multiple layers of bounding boxes simultaneously, where each layer represents a separate resource type. It also ensures that the resources on FPGA are used in an optimum way by allowing the modules with varying degree of resources to overlap on each other but only in different layers. There is no overlap of modules in same layer. To implement this, the floorplanner adapts according to the total usage of a particular resource type. If a particular resource type is not highly used by the application on the device, the floorplanner will put the bounding box of that resource close to the other bounding boxes of the module. If a particular resource type is highly used in the device, the floorplanner will find a strictest packing of that resource and will still try to place the bounding box of the resource close to the other bounding boxes of the module. In this way, the floorplanner will reduce the overall wirelength and area of the design. In our experiments, we used standard MCNC benchmarks and added random heterogeneous resources. We compared our results using a standard simulated annealing based floorplanner which takes only a single bounding box for each module. We call this traditional floorplanner as disjoint floorplanner. 1 We find that this simple floorplanner sometimes take twice as much area as our floorplan. Sometimes it even fails to meet the area constraints. For the designs that both the floorplanner succeeds, our floorplanner has almost same wirelength. We also present a case study of implementation on Xilinx Virtex device. The proposed floorplanner could meet the area and 1 In

this paper, we use traditional and disjoint floorplanner interchangeably.

There has been several work in the area of floorplanning and placement for FPGA. These work deal with the placement of netlist of LUT and FF in a regular array structure of configurable logic block [5]. The configurable logic blocks are either regularly placed (Xilinx, [5]) in an array fashion or hierarchical placed (Altera) on the device. Recently, Xilinx and Altera have come up with FPGAs that contain many different kind of resources along with the traditional LUT and FF. These resources are the block RAMs, multipliers, FIFOs and even processors. These resources are present in the specific locations on the FPGA devices. In order to make proper use of these resources, researchers [1]–[4] have started looking at placement and floorplanning of these heterogenerous resources. These work have used various techniques to find the floorplans of the blocks which include the heterogeneous components. Wang et al. [1] first introduced the problem of floorplanning for heterogeneous resources and highlighted the challenges, objectives and cost functions for this problem. In [2], [3], realization lists are computed of various shapes of the blocks that satisfy their resource requirements. These lists are then used to choose an appropriate shape of the block in the floorplan. However, all these work consider a block as a single entity and try to cover all the resources of that entity in a single continuous shape. Hence, it is not surprising that many of these floorplanner fail significantly when floorplanning for actual complex applications. Our proposed floorplanner handles each resource type as a separate layer and does floorplanning of each layer separately. In this sense, our floorplanner is similar to [6]. In [6], authors try to solve the problem of 3-D packaging of system-onpackage, by considering floorplans of different layers separately. While their approach and our approach may look same, their problem is however significantly different from the problem of floorplanning for heterogeneous resources. In the heterogeneous floorplanning, even if we treat each resource type separately, a block has to be still close to all its resources. This is not so in the case of 3-D packaging, where in each layer, there is a separate block. This leads to different objectives of the two problems. Thus our approach to solving this problem is significantly different than [6]. The basic idea in our proposed approach is the most similar to a very recently introduced approach presented in [7]. This work proposed a multi-layer density function which handles placement of each resource type in different layer. The basic idea presented in the paper is also that individual handling of different resource types could help in solving heterogenous placement problem. The paper proposes a timing-driven density function placement algorithm that tries to remove overlap of the various blocks in different layers. Our approach is different with this work in many ways. Our work uses a traditional sequence pair floorplanning approach and maps

it to be used for multiple resource types. In doing so, our floorplanner tries to maintain relationships of a block with all its resources in each layer. The floorplanner adjusts to the resource requirement of each resource type to align the resources of a block together. While there could be overlap in the floorplans of [7] resulting in an extra legalization stage, our floorplanner produces overlap-free floorplans. We now discusses our approach in detail in the next section. III. H ETEROGENEOUS F LOORPLANNER In this section, we describe our proposed floorplanner for heterogeneous resources on FPGA. We call our proposed heterogenous floorplanner HPlan. The recent works in heterogeneous floorplanning consider a module as a single bounding box containing multiple resources. One straightforward way to perform heteregeneous floorplanning to optimally use all the resources is to floorplan for each resource type separately. It will ensure a separate tight packing for each resource type that satisfies the fixed outline constraints. Such an approach, which involves multiple floorplanning, also does not guarantee that the various resources of a module are placed close to each other. A possible solution to this problem is to first floorplan all the logic (CLB) component of all the modules. And then floorplan other resource types making sure that the resources lie close to the logic components of the respective modules. However, such an approach could affect the performance and could lead to higly overlapping floorplans. Moreover, it involves number of stages for each resource type. A. Problem Formulation The multi-layer heterogeneous floorplanner for FPGA consists of a set B = {bi |0 ≤ i ≤ n} of n blocks and a set L = {li |0 ≤ i ≤ k} of k layers. A block bi has a resource allocation vector φi = (n1 , n2 , ..., nk ) which means that the block bi requires n1 CLBs, n2 RAMs, and so on. In each layer, a block is represented by a rectangular bounding box. A bounding box bbij for block bi in layer j is represented by bbij = (x, y, w, h), where (x, y) is the lower left point and w and h are the width and height of the bounding box bbij . The area represented by bbij = (x, y, w, h) should completely contain the φi (j) resources of the block bi and is the smallest such rectange of given width or height. A floorplan F is represented by a set F = {f1 , f2 , ..., fk } where a floorplan fi is a non-overlapping 2-dimensional placement of all the blocks B in the ith layer. An architecture description file (ADF) contains the the device size and locations of each type of resource (layer) in the device. A resource may lie only on the specific coordinates of the device. Thus, the floorplan fi contains bounding boxes of the modules only in the locations where the ith resource type is present on the device. A floorplan F is feasible if (i) F is free of overlap among the blocks in each layer, (ii) F satisfies the fixed outline constraints of the device in each layer, and (iii) F satisfies the resource location constraints of the device in each layer.

The first condition requires that a bounding box bbij will not overlap with any other bounding box bbmj belonging to the same layer j. Note that it means the bounding box bbij could overlap with any other bounding box of same module i or a different module in the different layer, and thus if need arises to pack the resources tightly, floorplan F could make best use of available resources. The second constraint requires that the final floorplan does not exceed the size of the device. The third constraint requires that the modules are placed in locations where the resources are available. Objective. The objective of the floorplanner is to find a feasible floorplan solution F that minimizes the following cost function: C(F ) = c1 .area(F ) + c2 .wirelength(F ) + c3 .aspect ratio penalty(F ) + c4 .bounding box deviation(F ). The first term area(F ) is the final area of the floorplan F which is simply equal to the maximum width times the maximum height in all the layers. The minimization of this objective results in smaller floorplan. The second term wirelength(F ) is half-perimeter bounding box based estimation of wirelength. For the purpose of simplicity we assume that the pins exist in the CLB layer only and thus wirelength is measured in that layer. However, our floorplan algorithm HPlan does not restrict this assumption. The third term aspect ratio penalty(F ) is the absolute difference between aspect ratio of the floorplan F and aspect ratio of the device. The minimization of this objective helps in meeting fixed outline constraint. The fourth term bounding box deviation(F ) is the function introduced for multi-layer heterogeneous floorplanning. It is used for bringing multiple bounding boxes of a block close to each other. For each block bi , the mean (xim , yim ) of the centers of all bounding boxes {bbij : 1 ≤ j ≤ k} is computed. The distance (half perimeter bounding box) of each bounding box from the mean is added to the bounding box deviation. The final bounding box deviation is equal to: Σni Σkj (|¯ xij − xim | + |¯ yij − yim |), where (¯ xij , y¯ij ) is the center of bounding box bbij . Minimizing this function will bring the resources of a block at various layers close to each other and will reduce the internal wirelength and clock period. In the next section, we describe our approach to solve this problem. We proposed a heterogeneous floorplanner HPlan which inherently minimizes the function by creating placements with reduced distances between bounding boxes.

IV. HP LAN F LOORPLANNER In this section, we describe our proposed HPlan floorplanner. Our floorplanner places every resource type separately and does not merge the resources of a block together to reduce complexity. It is thus more efficient in utilizing all the resources of device and gives a feasible placement after floorplanning. It still tries to put the resources of a block close to each other using various new techniques. One of the technique that we propose adapts to the resource utilization.

A. Fixed-Outline Simulated Annealing An FPGA is a bounded rectangle of fixed size. Hence, fixedoutline floorplanning is more suitable for it than just areaminimization floorplanning like in [1]. Simulated annealing technique is highly used in floorplanning in ASIC for its tendency to give better results than deterministic approaches. HPlan is based on Parquet [8], [9] which is a fixed outline simulated annealing floorplanner for ASIC. B. Sequence Pair for Multiple Layers To represent multiple floorplans for each layer, multiple sequence pairs can be used. If each sequence pair is independent of each other, the resources of a module could be very far apart from each other. We therefore use single sequence pair for all the layers. In this manner, our work is similar to [4]. However, they have single bounding box for each module and their work does not guarantee that all the resources exist in the locations modules are placed after simulated annealing. 1) Traditional Sequence Pair: We use sequence pair representation to make random moves of placement. In sequence pair representation, two permutations (orderings) of the blocks are maintained. The two permutations capture geometric relations betweeen each pair of blocks. Every two blocks constrain each other in either vertical or horizontal direction. The following relationships hold for sequence pairs: (< ..., p, .., q, .. >, < ..., p, .., q, .. >) ⇒ p is left of q (< ..., p, .., q, .. >, < ..., q, .., p, .. >) ⇒ p is above q. The sequence pair representation is shift-invariant since it only encodes pairwise relative placements of modules. Actual placements are produced by aligning from horizontal and vertical axes, starting from x = 0 and y = 0. All neighboring blocks are placed adjacent to each other. This representation automatically does compaction of the floorplan, as no whitespace is added and blocks are placed as close to each other as possible. The following lemma holds true for computing placement in sequence pair. Lemma 1: In computing placements of blocks using sequence pair, we always move in the topological order of the left-right or top-bottom relationships in the sequence pair. We use a single Sequence Pair to represent the placements of all the resources. 2) MultiBox Sequence Pair: We call the sequence pair for multiple bounding boxes as multiBox Sequence Pair. The following rule is defined for multiBox sequence pair. (< ..., p, .., q, .. >, < ..., p, .., q, .. >) ⇒ bbpj is left of bbqj ∀j, 1 ≤ j ≤ k (< ..., p, .., q, .. >, < ..., q, .., p, .. >) ⇒ bbpj is above bbqj ∀j, 1 ≤ j ≤ k. Hence, each bounding box in a layer is related to all other bounding boxes of same layer using either of the above two properties.

Theorem 1: In multiBox sequence pair, a bounding box is not related to other bounding boxes of the same module. The above theorem gives flexibility to place a bounding box of a module in any location as required by its location constraints. The following theorem however mentions an important property that characterizes relationship between bounding boxes of a block. Theorem 3 then provides an important use of this relationship. Theorem 2: The use of a single sequence pair in multiBox sequence pair ensures that a bounding box of a module in one layer has same relationships with other bounding boxes of that layer as other bounding boxes of that module have relationships with the other bounding boxes in their layers. Theorem 3: The multiBox sequence pair permits a placement in which the centers of all bounding boxes of a module are located in the same coordinates. Proof: In multiBox sequence pair, bounding boxes in each layer could be spaced far apart from each other by horizontal and vertical shifts. If the horizontal and vertical shifts in each layer are sufficient enough, then the centers of BBs in each layer could match with the centers of BBs of their respective modules. Since, according to Theorem 2, all BBs of a module follow same relationships with BBs of other modules, there are no conflicts in placing all the BBs of a module on top of each other (aligned in same coordinates). It is not possible that in resource layer j, bbaj is left of bbbj and in resource layer l, bbal is right of bbbl . Hence, there will be no conflict in horizontal and vertical shifts. Theorem 3 says that if the device is big enough, all the resources of every module can be placed close to each other. This helps in minimizing internal wirelength of a module. Theorem 1 says that if there are not enough resources available at specific locations, different bounding boxes of a module can be placed in separate coordinates. Thus, multiBox sequence pair can be used to generate floorplans that vary from extremely compact placements for strict resource constraints to a placement with properly aligned bounding boxes. If the resource constraints are very strict in one or more layers, then the bounding boxes in those layers could be tightly packed, irrespective of the locations of the bounding boxes in other layers. However, since the other layers are not so tightly packed, the bounding boxes in those layers could be shifted to the locations close to the placement of the bounding boxes in the tight packed layers. In this way, multiBox sequence pair allows placements that adapt to the resource constraints. In the next subsection, we propose a model of the resource architecture. An architecture description file (ADF) models the resource architecture in which the locations of various resources is specified in a simplified manner. C. Resource Locations in ADF The architecture description file contains exact locations of all the resources on a device. Since the number of such resources can be very large, we take a simplistic model of heterogeneous resource locations. Every resource layer in ADF is described by the following parameters.

1) X-Sequence: X-Sequence is a sequence of X-coordinates (x1 , x2 , x3 ,...) that describes the starting X-coordinates of the resource unit. For example, the X-sequence for BRAM blocks in Xilinx Virtex 4 sx35 device is (4, 8, 12, 16, 24, 28, 32, 36). The X-sequence for CLB is a continuous sequence. 2) Y-Sequence: Similarly, Y-Sequence is a sequence of Ycoordinates that describes the starting Y-coordinates of the resource unit. For example, the Y-sequence for BRAM blocks in Xilinx Virtex 4 sx35 device is (0, 4, 8,..., 92). The Ysequence for CLB is a continuous sequence. The exact coordinates of the resource units can now be found by choosing any one element from X-sequence and any one element from Y-sequence. 3) Width and Height: The width and height of the resource units are expressed in the units of CLB. For example, the width and height of DSP blocks is (1,1) and of a hard PowerPC processor in Xilinx Virtex 4 fx device is (8, 24). The above parameters completely describe the coordinates and sizes of the resource units in ADF. This simplistic model is used to easily find placements of the modules. Moreover, it allows us to process the horizontal and vertical coordinates of a resource in the sequence pair separately. The horizontal coordinates of blocks in multiBox sequence pair can be found considering locations in X-sequence and vertical coordinates can be found considering Y-Sequence. We introduce multiBox sequence pair to allow proper alignment of bounding boxes. In order to ensure that the bounding boxes are properly aligned while satisfying the resource constraints, we now propose an adaptive method to compute the placements (x and y coordinates) of each bounding box using the sequence pair. D. Resource Alignment We now come to an important part of our proposed floorplanner. This part describes how the actual coordinates of the bounding boxes are computed given a sequence pair. Using the rules defined in Section IV-B.2, the bounding boxes of a module could be very far from each other if the bounding boxes are placed independently in each layer. For example, it is possible that in a certain floorplan, all the BRAM blocks of a module A are to the left of a device while the module A’s slices (CLBs) are to the right of the device, especially when the modules that are left to the module A do not have any BRAM blocks. This could severly increase the internal wirelength of the module A. Through the cost function bounding box deviation, as described in Section III-A, the simulated annealing engine will not accept the placements in which bounding boxes are far apart. However, if none of the floorplan generated by multiBox sequence pair has the boxes close to each other, then the cost function alone will not be able to get a good floorplan. Hence, a sequence pair for multiple bounding boxes should be able to create floorplan with bounding boxes close to each other. Traditionally, the longest path algorithm is used to compute the placement of blocks in sequence pair. This algorithm packs all the blocks tightly. However, for the simultaneous

placements of multiple layers, all the bounding boxes need not be tightly packed. The final area of the floorplan is determined by the combined placement of all the layers and hence if a layer has smaller area than the rest of the layers then the layer need not be tightly packed. In such scenario, bounding boxes in the layer with smaller area should be placed close to the bounding boxes of their respective modules in other layers. The floorplanner should be able to push the bounding box in a layer with smaller area close to other bounding boxes of the block. We propose a new algorithm to compute the placement of the bounding boxes in multiBox sequence pair. Like its traditional sequence pair counterpart, our algorithm moves in the topological order of the left-right or top-bottom relationships. It iteratively processes one block at a time and assigns horizontal or vertical coordinates to the bounding boxes of that block. However, the algorithm does not give the tightest placements to each bounding box. The intuition behind our algorithm is as follows. Every bounding box of a block will be placed in either horizontal or vertical direction depending on the placement of its own layer. The bounding box which gets the farthest placement (in either horizontal or vertical direction) is the most tightly constrained box and can not be shifted closer to other bounding boxes of the block, i.e. in the backward direction. Hence, other bounding boxes should move closer to that bounding box in the forward direction. Therefore, when computing placement of each bounding box of a block, the bounding boxes are pushed towards the farthest bounding box. Ideally, to minimize the internal wires, the bounding boxes should be pushed to such extent that each bounding box lie on top of each other. And the whole block will look like a single bounding box. This is what would happen in the traditional floorplanners. However, this lateral push of a bounding box affects the area of the corresponding layer. A lateral push of a bounding box means that the bounding box is not placed in the location where it could be tightly placed. Hence, this push will make some resources of that layer unutilized. It is also not possible that a bounding box of some other block will occupy that resource because after the original block has been placed using the topological order, other bounding boxes can only be placed in the forward directions (right or bottom direction). Otherwise it leads to conflict in sequence pair representation. Therefore, the amount of this push of a bounding box should be based on the number of resources of that type available on the device and the number of occupied resources. Figure 2 gives the pseudo code of the proposed placer algorithm which is used to compute the placement of the blocks. For each layer l, we first compute the resource allocation ratio, αl , defined as the area of the layer over the area of the device. If, for a heterogeneous resource type, m resources are used and n resources are available, the value of αl is m/n. The value of αl describes the tightness of the area contraint of a particular layer. To compute the placement of a block i in layer l, we compute the longest path placements of each bounding box of block i. We then compute center-difference of the lth bounding

Fig. 2. Pseudo Code of the Placer Algorithm. The Placer algorithm is used to compute the placements of bounding boxes using multiBox sequence pair.

box of block i which is the lateral difference in the centers of the farthest bounding box of block i and the current bounding box. The push value of block i in layer l, pushil is computed as pushil = center−dif f erenceil ∗(1−α2l )∗γ. It consists of three components. The center − dif f erenceil describes the distance of the lth bounding box of block i from the farthest bounding box of block i. The γ is a constant value used for tuning the displacement results. In our experiments, γ is set to 0.5. The value (1 − α2l ) is close to 0, if the value αl is close to 1, signifying that the bounding box should not be pushed when the layer is heavily utilized. The value of square of αl is found to give better results than just the αl . The value of pushil is then added to the coordinate of the lth bounding box and the new coordinate of the bounding box is determined. The following lemma describes an important property of the placer algorithm. Lemma 2: The coordinates of all bounding boxes of a block are computed in a single iteration of the placer algorithm. Lemma 2 implies that in a single topological order iteration, our placer algorithm is able to compute the coordinates of all bounding boxes of all blocks. Hence, the placer algorithm is computationally efficient. The following theorem is mentioned without proof. Theorem 4: The complexity of the placer algorithm is O(n∗ l ∗ logn), where n is the number of blocks and l is the number of layers. Reader may note that the push function described above is not used in the cost function of simulated annealing but in actual coordinates of the bounding boxes. Thus it is a deterministic way to place the coordinates of the blocks and is not part of an objective. The next section is a short discussion on meeting the RLOC constraints of a blocks using the proposed algorithm. E. Relative Location (RLOC) Constraints The method described in the previous sections place the various resources of a block in different locations. This gives more savings in area and more feasibility in the floorplan. However, there may be a design core which requires that all

the resources of the core are placed in fixed relative positions only. These constraints are called as RLOC constraints of the IP core. Our floorplanner can also handle the special RLOC constraints of a block. Lemma 2 tells that the coordinates of all the bounding boxes of a block are computed simultaneously in the placer algorithm. Hence, if there is a special need of a block, then each bounding box of that block can be pushed appropriately to make sure that its alignment with the overall center of the block satisfies the relative location constraints. The RLOC constraints of each block can be given as input to the floorplanner. The HPlan floorplanner will then satisfy these constraints in each simulated annealing iteration by generating the placements using the placer algorithm with RLOC constraints. We now proceed to the experimental section of this work. V. E XPERIMENTS In this section, we discuss the advantages of our approach compared to a traditional floorplanner. The next subsection describes the experimental setup. A. Experimental Setup We implemented our HPlan floorplanner in Java. A traditional floorplanner based on parquet but for the heterogeneous resources is also developed in Java. This floorplanner considers only a single bounding box for each block. We call this floorplanner, disjoint floorplanner, as it keeps all the blocks separate from each other The disjoint floorplanner uses same cost functions, and same moves as the HPlan floorplanner. The disjoint floorplanner and HPlan floorplanner are thus identical in every respect except that the former does not allow overlap of the bounding boxes whereas the latter does. This makes the comparison between two floorplanners very effective as it only shows the effects of using our proposed technique over the traditional technique, and does not compare the individual differences between the various floorplanners. We used the Xilinx Virtex 4 SX architecture as a model of our heterogeneous architecture. The main heterogeneous components of the architecture are BRAM (block RAMs), DSP blocks (multipliers), and FIFO. An architecture description file (ADF) is created for this architecture as described in Section IV-C. Our experiments are done on a Windows XP machine with 3.2 GHz processor and 2 GB RAM. There is no benchmark set available in literature to evaluate the placement of heterogeneous resources. We use the MCNC benchmark for our experiments which is commonly used in the physical design research to evaluate the floorplanning tools. The blocks in these benchmarks only have single resource type. The blocks are assumed to be soft blocks which means that their shapes can change during floorplanning. We modified the benchmarks to add the heterogeneous components to each block. The original area of the blocks is assumed to be the number of CLBs inside the blocks. The heterogeneous components are then assigned to each block randomly as discussed below.

Heterogeneous Resource Assignment: The assignment of heterogeneous components to the MCNC benchmarks is done using a Java application. It assigns resources to each application in the following manner. First, the total area of the application is computed by adding the area of each block. Then, using the aspect ratio of the Xilinx Virtex 4 SX device and the total area of the application, the height and width of the application is found. The MCNC benchmarks are very big in size. The total sizes of their floorplan are around 5 to 10 times the size of the largest Virtex FPGA device. This makes the task of floorplanner more difficult and tests the ability of our floorplanner. Using the height and width of the hypothetical large Virtex 4 SX device, the application then finds the number of heterogeneous resources in the device. Section IV-C shows how the heterogeneous resources of a device of any size can be estimated. The heterogeneous resource allocation of each block is assumed to be a guassian random variable. For each resource type, the mean value µ (number of resources per block) is computed. The Java application is also given as input the standard deviation ratio, ρ. The standard deviation σ of the resource allocation is calculated by multiplying ρ with the mean µ, that is, standard deviation σ = ρµ. The resources of each type are then assigned to each block using the guassian random number generator. If the number of resources of a block found using the random number generator exceeds the size of the hypothetical Virtex 4 SX device, then the number of resources in that block is capped to the maximum resources available on device. B. Results 1) Resource Variation: The blocks in the MCNC benchmarks have relatively similar sizes. If the sizes of heterogeneous resources of the blocks (size of the bounding box) are also similar to each other, then there is more overlap between the bounding boxes of a block. Therefore, the disjoint floorplanner which considers only a single common bounding box for each block has more uniform size of the block when the number of heterogeneous resources in the block is uniform across all the blocks. Conversely, if there is more variation in the distribution of resources in the blocks, then there will be higher variation in the size of the common bounding box. Hence, it is possible that for some blocks, while there are very few resources of one type, the common bounding box of the block is very big. Thus, there will be lot of empty space in the common bounding box of disjoint floorplanner, and as a result, the floorplanner will not meet the area constraint or will have higher area. Clearly, the HPlan floorplanner will have no such limitation as it considers individual bounding box of a block for each resource type. In real life applications, there is a high variation in the heterogeneous resource usage. An IP core mainly consists of one or two types of heterogeneous resources. For example, an on-chip cache will have large number of BRAMs and will have a negligible number of DSP blocks. Similarly, an FFT multiplier will have large number of DSP and FIFO blocks but few BRAMs. Thus, floorplanner should handle this kind

of variation in resources effectively, while minimizing area and wirelength. Table I gives the area of the floorplans of various MCNC benchmarks using two floorplanners at various resource allocation variations. The value ρ determines the standard deviation of the resource allocation and is explained in Section V-A. As the value of ρ increases, the number of resources in the blocks becomes more non-uniform. There is no change in the number of resources in different rows of the table. Only the uniformity of resource allocation decreases along the rows of the table with increasing value of ρ. As seen from the Table I, the disjoint floorplanner takes more area than the HPlan floorplanner in all the benchmarks at all values of ρ. Hence, if the HPlan floorplanner is more efficient in reducing the area of the design by efficiently allocating the resources. Table I also shows that as the variation of the resource distribution increases (ρ increases), while the area of the HPlan floorplanner almost remains constant, the area of the disjoint floorplanner increases significantly. For example, in case of benchmark, hp, the HPlan takes about 1.65E7 units of area for all the values of ρ. However, the disjoint floorplanner takes 2.16E7 units of area for ρ = 0.5 and 3.02E7 units of area for ρ = 2. Similar results can be seen for the benchmarks ami49, apte and xerox. In benchmark ami33 for ρ = 1, the disjoint floorplanner failed to find a feasible floorplan. This is because as the variation in the resource distribution increases, there is significant increase in the size of common bounding box of the disjoint floorplanner. That is, the disjoint floorplanner tends to waste a lot of area because it uses a single bounding box for each block. Hence, it is difficult for the disjoint floorplannner to meet the area constraints or reduce the total area of the application. Some of the results in Table I do not follow the general trend of other benchmarks. Examples are benchmark apte for ρ = 2, or benchmark ami for ρ = 0.75. This is because the resource distribution in our examples is inherently random. Hence, it is possible that sometimes there is no significant variation in the resource distribution (by the random number generator) even if the standard deviation is high. It is possible that sometimes the distribution is either too uniform (as in the case of benchmark ami33) even if ρ is high, or it is too non-uniform (as in the case of benchmark apte at ρ = 1.5). However, in each of these cases, the HPlan and the Disjoint floorplanner have similar resource distributions and the disjoint floorplanner takes more area than HPlan. 2) Floorplan Runtime: Theorem 4 shows that the complexity of the HPlan floorplan is a constant factor of the time complexity of the traditional non-heterogeneous floorplanners. The constant factor is the number of heterogeneous resources types. Thus, in our case, the floorplanning runtime should be 4 times the disjoint floorplanner. We found that in practice, however, the runtime of the HPlan is much less than the estimated runtime. The average runtime of the HPlan floorplanner is found to be 1.4 times the traditional non-heterogeneous floorplanner.

TABLE I A REA OF THE MCNC BENCHMARKS USING TWO FLOORPLANNERS UNDER DIFFERENT RESOURCE VARIATION . S YMBOL X

SHOWS THAT THE

FLOORPLANNER COULD NOT MEET THE AREA CONSTRAINTS

Benchmark

ami33

ami49

apte

hp

xerox

Floorplanner

HPlan

Disjoint

HPlan

Disjoint

HPlan

Disjoint

HPlan

Disjoint

HPlan

Disjoint

ρ = 0.5 ρ = 0.75 ρ=1 ρ = 1.50 ρ=2

1831655 1.59E6 1.94E6 2.04E6 1.98E6

2310860 2.11E6 X 3.3E6 3.3E6

5.76E7 5.76E7 6.27E7 6.19E7 6.19E7

7.75E7 8.59E7 9.39E7 9.95E7 11.5E7

7.23E7 7.15E7 7.01E7 8.14E7 6.35E7

8.30E7 9.43E7 10.1E7 14.6E7 6.93

1.56E7 1.73E7 1.75E7 1.59E7 1.52E7

2.16E7 2.3E7 2.4E7 2.57E7 3.02E7

3.05E7 3.69E7 3.27E7 3.09E7 3.06E7

4.086E7 4.39E7 4.54E7 5.2E7 5.78E7

C. Embedded Processor Case Study In this section, we discuss the implementation of a system with heterogeneous components on a Xilinx Virtex device using the floorplan with multiple bounding boxes. We use Xilinx ISE 8.2 to design, synthesize, and implement our system. The device used is Xilinx Virtex 4 SX-35 FPGA. We intend to see the working of the system when the bounding boxes of a component are specified separately. Xilinx ISE uses a ucf file to specify user constraints of area and performance. We use the ucf file to specify floorplanning constraints generated by our floorplanner. The ucf file could be specified area constraints describing the locations of the CLB, BRAM, DSP, and FIFO components of a block. We implement a typical application with a single Microblaze soft processor and multiple cores as coprocessors. There are two different FFT cores, one DES core and one CRC core. The FFT cores are generated using Xilinx Coregen library. The DES and CRC cores are generated using the open source VHDL code and synthesized using Xilinx ISE. While the DES and CRC cores are used for network applications, FFT core is used for multimedia applications. Microblaze OPB and FSL wrappers are then created for each of the coprocessors. These coprocessors are then attached to the Microblaze processor using Xilinx Platform Studio 8.2i. Table II shows the configuration of the Microblaze processor and the various coprocessors used in the system. It also shows the estimated sizes of the heterogeneous components in each block. The timing constraint is 100 MHz. Figure 3 shows the area constraints in the Xilinx PACE tool of the various components as found by our floorplanner. . We only show four components in this figure to show the usefulness of our approach. Figure 3 shows some extra BRAM blocks that are used by the blocks outside their CLB boundaries. If these extra BRAM blocks are put inside one common bounding box (as done by the disjoint floorplanner), the four blocks will occupy almost whole area of the device. There will not be any space left for other components on the device. Hence, the disjoint floorplanner f ails to meet the area constraints in this system. However, using the HPlan floorplanner, the place and route tool is able to successfully implement the system and meet the timing constraints. The final implementation takes about 38 minutes to place and route. The system uses 98% of the slices, 63% of the BRAMs and FIFO, and 71% of the DSP48 slices. While the disjoint floorplanner could not meet the area constraints,

S YSTEM C ONFIGURATION OF

TABLE II M ICROBLAZE P ROCESSOR AND ITS

THE

C O -P ROCESSORS Blocks

Configuration

Microblaze

Floating point unit, Integer multiplier and divider Local Memory Bus (LMB) cache FSL and OPB interface with the coprocessors Estimated Size: 34 BRAM, 7 DSP and 7 FIFO

LMB

32 KB Instruction and 32 KB Data Caches Estimated Size: 32 BRAM

SRAM Controller

For connecting with external SRAM OPB Connection with the processor Uses only CLB

FFT 4096

Fast Fourier Transform IP Core (Xilinx Coregen) Transform length: 4096 OPB interface with Microblaze Estimated Size: 76 DSP slices and 44 BRAM

FFT 2048

Fast Fourier Transform IP Core (Xilinx Coregen) Transform length: 2048 OPB interface with Microblaze Estimated Size: 54 DSP slices and 11 BRAM

DES Core

FSL Interface with Microblaze Uses only CLB

CRC Core

FSL Interface with Microblaze Uses only CLB

Fig. 3. Area constraints of various components of the system on Xilinx Virtex device

the HPlan floorplanner could meet area and performance constraints. Hence, our floorplanner is clearly beneficial over the disjoint floorplanner in real-life applications where the variation in resources could be very high. In our experiments, we showed that our proposed floorplanner HPlan can handle heterogeneous resources more efficiently than the disjoint floorplanners that use only a single bounding box. This lead to savings in area and improvement in feasibility. On actual implementation on a Xilinx device, our floorplanner could meet the area and timing constraints while the disjoint floorplanner could never meet the area constraints. VI. C ONCLUSIONS In this paper, we proposed a new innovative floorplanner to handle heterogeneous resources. Our floorplanner handles each resource type individually. It allows for different placements of different resources of a block. This leads to huge savings in the area of the floorplans. Moreover, the floorplanner tends to better utilize the available area to optimize the internal wirelength of the blocks while improving the area feasibility of the overall floorplan. The experiments show that the proposed floorplanner can reduce the area by as much as 50% of the area of a disjoint floorplanner. In real applications, the proposed floorplanner could meet the area and performance constraints, where the disjoint floorplanner could not find a feasible floorplan. The runtime of the floorplanner is also very low. For future work, we tend to use this floorplanner to handle multiple objectives. R EFERENCES [1] M. Wang, A. Ranjan, and S. Raje, “Multi-million gate fpga physical design challenges,” in Proc. IEEE International Conference on Computer Aided Design (ICCAD), San Jose, California, 2003, pp. 891–898. [2] L. Cheng and M. D. F. Wong, “Floorplan design for multi-million gate fpgas,” in Proc. IEEE International Conference on Computer Aided Design (ICCAD), San Jose, California, 2004, pp. 292–299. [3] J. Yuan, S. Dong, X. Hong, and Y. Wu, “Lff algorithm for heterogeneous fpga floorplanning,” in Proc. IEEE Conference on Asia and South Pacific Design Automation, 2005, pp. 1123–1126. [4] Y. Feng and D. P. Mehta, “Heterogeneous floorplanning for fpgas,” in Proc. IEEE International Conference on VLSI Design (VLSID), 2006, pp. 257–262. [5] V. Betz and J. Rose, “Vpr: A new packing, placement and routing tool for fpga research,” in International Workshop On Field Programmable Logic and Applications, 1997. [6] P. H. Shiu, R. Ravichandran, S. Easwar, and S. K. Lim, “Multi-layer floorplanning for reliable system-on-package,” in Proc. IEEE International Symposium on Circuits and Systems (ISCAS), 2004, pp. V69–V72. [7] B. Hu, “Timing-driven placement for heterogeneous field programmable gate arrays,” in Proc. IEEE International Conference on Computer Aided Design (ICCAD), Nov. 2006, pp. 383–388. [8] S. N. Adya and I. L. Markov, “Fixed-outline floorplanning through better local search,” in Proc. IEEE International Conference on Computer Design (ICCD), Austin, 2001, pp. 328–333. [9] ——, “Fixed-outline floorplanning: Enabling hierarchical design,” IEEE Trans. VLSI Syst., vol. 11, no. 6, pp. 1120–1135, Dec. 2003.

Suggest Documents