equivalent modul exists in module library then ... 3.3 Linear Optimization of the Distribution Pro- ... linear program like the capacitated facility location problem.
Device Selection for System Partitioning* Ulrich Weinmann1, Oliver Bringmann1 and Wolfgang Rosenstiel2 1
Computer Science Research Center at the University of Karlsruhe (FZI) Haid-und-Neu-Str. 10-14, 76131 Karlsruhe, Germany 2 FZI and University of Tübingen, Sand 13, 72076 Tübingen, Germany Abstract
This paper presents a new approach to the problem of selecting suited target technologies and devices for complex circuits. The implemented algorithm is based on formulating partitioning and device selection as a general facility location problem. The complexity of this NP-hard problem is reduced by transferring the constraints of the function in a dual lower bound function using Lagrangian relaxation and subgradient optimization. The evaluated sample circuits, generated by High-Level Synthesis tools, show cost minimized implementations using low computing times.
1 Introduction
ogy to the annealing process in physics to avoid local minima, the Min-Cut two-way partitioning [6] is very efficient due to group swapping. These basic partitioning algorithms have been improved in the past, mostly by reducing the calculation complexity to receive faster partitioning results. In 1993 Kuz˘nar, Brglez and Kozminski [8] first showed the necessity of the relationship between partitioning and technology mapping, using an improved Min-Cut approach. Here the circuit is mapped into one architecture and the resulting logic blocks can be distributed to different devices of this architecture under minimizing a cost function. Different target architectures or functional preferences of circuits to specific devices are not considered.
2 Overview of the Device Selection Approach The partitioning of logic structures has become a very important topic for system implementations. Especially the problem of realizing large circuits by programmable or standard devices is still not extensively investigated in the sense of cost minimization, timing optimization or optimal use of area. An example for that are available partitioning tools for rapid prototyping systems using only one programmable architecture. For larger circuits the use of area per device is rather low and the timing shows problems due to a high number of interconnections between chips [9]. In practice today the designer has to partition the circuit manually and decides about the best suited target technology. This is often not possible due to the high number of device and circuit specific data, which have to be considered. Therefore the problem of partitioning circuits into different target architectures including an automatic device selection is not yet solved. Many approaches in this field are focused on the optimization of the partitioning problem, including area, timing and interconnection minimization. However, an exact solution to this problem can’t be found in the sense that no polynomial-time algorithm for it is known. The clustering partitioning can be divided into seed selection, unplaced node selection and node placement. This approach is easy to implement, but depends strongly on the selection of the seed nodes. Besides simulated annealing [7], which uses an anal* This research is supported by the ESPRIT project 6855 „LINK“
This paper proposes an automatic device selection and partitioning approach into different architectures. The approach is able to respect functional dependencies between the circuit and the target devices in order to achieve optimized implementation results. Figure 1 shows a general overview. The system starts with a pre-partitioned netlist, which can be derived from hierarchical clusters (generated by High-Level Synthesis) or a clustering step by the partitioning tool. Using this module description the device selection tool generates a cost function including all user constraints and available library data, which include programmable (FPGA/PLD) and standard devices. A technology mapping tool will be used to calculate implementation data like area or delay for each technology. Now the device selection step solves two implementation problems, which heavily depend on each other. On the one hand for each cluster a suited target technology has to be found. On the other hand each cluster itself can be partitioned and the resulting subclusters have to be assigned to a target technology. Therefore the optimization problem is the question of which portion of a cluster to implement in which target technology. Chapter 3.3 in this paper will show that both goals can be combined to one function. The system includes a technology mapping for universal architectures [11,12] and a maximum flow based
partitioning approach [13]. After minimizing the cost function a board layout with devices and interconnections will be generated. The design can directly be implemented, because all partitions are already mapped into the according devices. For the first time also standard devices can automatically be included in the design process, which are especially useful for overall cost minimized implementations. The use of mask programmed ASIC is possible, if the user constraints can not be met by programmable logic.
(1)
Specification of the primary optimization goal Reduction of the device library according to user constraints
(2)
Analyse netlist & Cost calculation / estimation
(3)
Device selection and partitioning
all clusters mapped?
Technology mapping of not yet mapped clusters
no
yes User Constraints
Device/ ModuleLibrary
Hierarchical Netlist
Device Selection Constraints Calculation
Partitioning according to the selected devices
partitioning possible?
Partitioning Tool Constraints
yes
FPGA
(C)PLD
Capacity of device exceeded?
yes
Reduction of the capacity allowed for the device
no Implementation
Standard Devices
Figure 2: Figure 1:
Raising of the costs of not partitioned clusters
Placement & Routing
Constraints Technology Mapping Tool
ASIC
no
Overview of the System
3 Device Selection and Partitioning The selection of suitable target technologies for a circuit is the most important step during the system partitioning. This device selection can be divided into three steps. During the first step the user defines the primary optimization goal of the implementations and special constraints, which have to be considered. In the second step the system performs a evaluation of the circuit and the user constraints and puts limitations on the optimization step (reduction of the device library etc.). The last step the actual implementation is computed.
3.1 Constraints and Optimization Goal The most important influence derives from the user constraints, which will be divided into two categories. The system constraints define the overall optimization goal for the partitioning (see table 1). The first factor minimizes the total cost of an implementation, that means the cheapest realisation will be chosen using the device costs of the library. The minimization of the number of devices results in the implementation with the lowest number of chips. To optimize the system speed critical paths are clustered and the wires on a board layout will be minimized by reducing the number of interconnections between the chips (see partitioning tool [13]). Table 1:
(1) User constraints and optimization goal (2) Evaluation of circuit and constraints (3) Cost function and mixed integer linear optimization Figure 2 shows a more detailed overview of the following steps. After constraint specification (1), and netlist analysis (2), which includes the technology mapping, a cost function is generated including all available data. After minimizing this cost function, the system results with an allocation of clusters, or portions of clusters, to target devices. If all clusters are mapped into these target technologies, the clusters have to be partitioned according to the results of the cost function. If this is possible and all devices could be routed, the system can be implementedIn the following the first three steps will be discussed in more detail. Main emphasis will be put on the device selection problem.
Optimization Flow
System Constraints
User Constraints Device Constraints
1
Overall costs
Reprogrammability
2
Number of devices
Testability (e.g. Scan Path)
3
Speed
Power consumption (3V)
4
Wires on the board
Range of temperature (mil)
5
--
Technology (e.g. Anti-Fuse, SRAM, (E)EPROM)
6
--
Reliability
The device constraints are user specific adjustments, which put preferences on special devices. In comparison to the system constraints they do not influence the partitioning or technology mapping optimization steps. These constraints help to reduce the library of possible target technol-
ogies and thereby also the complexity of the selection problem. If, for example, the user puts restrictions on the power consumption of a device, the target library will be reduced to devices, which meet this constraint.
3.2 Evaluation of the Circuit During this phase all clusters of the netlist are examined for their use of combinational gates, flipflops, edges or pins. Furthermore, special circuit attributes, like 2-, 4- or more level descriptions, are evaluated. Table 2:
Cluster Attributes and Device Preferences
Cluster Attributes
Preference for Device
• High use of storage elements
⇒ Architectures with predefined storage elements
• High proportion of combinational gates
⇒ Architectures without predefined storage elements
• Datapath implementations
⇒ FPGAs
• Two/four-level circuits (e.g. FSM)
⇒ (C)PLDs
• Time critical clusters
⇒ Fast devices, (C)PLDs
• Standard modules (Add, Mult)
⇒ Standard devices
Using this information, preferences can be derived for partitions, which can efficiently be implemented in special devices. The device library includes information for the suitability of such circuit constraints for special technologies like: It should be mentioned, that standard module implementations or user experience can be added to the library. For the device selection step these dependencies are only considered, however in order to minimize the overall cost function the resulting target devices can differ. This cost function can be calculated using different implementation factors. Beside the already discussed preference factor, the area, delay and total cost of the implementation can be weighted. The following algorithm shows the strategy to calculate a cost function for every cluster in a circuit being implemented in every device of the library. The first factor of the cost function considers the delay. Its weight factor can be controlled by the user constraints. The second factor combines area consumption and device cost, which equals the cost of a cluster being mapped in a special device. The third factor reflects the preferences. Generally, this function includes the basic implementation goal, however further factors can be added. The computation speed of this step can be adjusted by using estimators instead of actual implementation tools. Experimental results showed, that even larger numbers of clusters can be mapped with low computational costs because of their relatively small size. The actual time consuming problem is the device selection or cluster distribution problem, which has to be solved.
for all clusters i calculate area consumption (e.g. gate count) if area > amax then pre-partitioning of the cluster i for all devices j if equivalent modul exists in module library then take implementation costs cij from module library else calculate / estimate the costs aij, tij, pij calculate implementation costs of clusters i in device j
c ij = w t ⋅ t ij + w a ⋅
F j ⋅ a ij Kj
+ w p ⋅ p ij
end if end for end for
where: cij = costs to implement cluster i in device j, wt, wa, wp = weight factors to adjust costs for implementation and devices, tij = maximal delay of cluster i in device j, Fj = costs of device j, aij = use of area of clusters i in device j, Kj = total area of device j, pij = preference factor for implementing cluster i in device j, a max = predefined upper border for cluster sizes. Figure 3:
Cost Calculation / Estimation
3.3 Linear Optimization of the Distribution Problem The goal of this optimization step is an optimal selection and allocation of clusters in a netlist to one or more suited target devices. Figure 4 depicts the situation. The preselection step of chapter 3.1 reduces the number of devices, the partitioning or hierarchical division of the system description produces the clusters i. Each cluster or subsets of these clusters have to be assigned to suited devices.
cij Netlist with clusters i Figure 4:
Fj devices j
device library reduced library
Partitioning and Selection Problem
In order to give a mathematical description of this problem, we use a mixed integer programming formulation. Generally the problem of distributing partitions into one or more target devices can be adopted from the facility (or plant, warehouse) location problem (FLP). Here the goal is to minimize the costs to open a new warehouse and the costs of transporting goods from a warehouse to a customer. With given customer locations, the locations for the warehouses
must be selected to minimize the transport costs. The solution to this problem gives the optimal locations for facilities to satisfy all customers demands. This problem formulation can be transferred to the device selection problem. In this case different circuit partitions have to be distributed to several devices under special cost criterias. According to formula of [1] the problem can be adapted and formulated as follows: Minimize:
Z =
∑ ∑ cij ⋅ xij + wd ⋅∑ Fj ⋅ yj
i∈C j∈D
subject to
(A)
j∈D
∑ aij ⋅ xij ≤ Kj ⋅ Aj ⋅ yj ,
∀ j ∈ D,
(B)
∑ xij
∀ i ∈ C,
(C)
i∈C
j∈D
= 1,
0 ≤ x ij ≤ 1, y j ∈ N 0 ,
∀ i ∈ C, j ∈ D, (D)
where cij = costs to implement cluster i in device j, xij = portion of area of cluster i in device j, yj = number of devices of type j , Fj= costs of device j, aij= use of area of clusters i in device j, Kj= total area of device j, Aj= proportion of Kj, that should be used j (0 ≤ Aj ≤ 1), wd= weight to adjust costs for implementation and devices C= clusters C = {1,…,m}, D = devices D = {1,…,n}. This function can also be formulated as a mixed integer linear program like the capacitated facility location problem [1]. In this formulation, j indexes the devices or chips, which are available for the realisation, and i the clusters of the circuit to be implemented in the devices j. cij equals the costs for a cluster to be implemented in device j, while xij allows a portion of a cluster to be implemented (this produces the input to the partitioning tool). The product (Fi·yi) adds the device cost to the formula. Very important are the constraints (B) and (C), which demand the implementation not to grow larger than the capacity of the device. (B) restricts the maximal capacity of a device, while (C) secures the allocation of all partitions of a cluster. Further constraints like limiting the device or I/O count can be added without changing the complexity [4]. Constraint (D) assigns integers to the number of devices. A solution to this minimization problem can be found in different ways. For smaller problems (e.g. n,m < 10) an exhaustive search can calculate all nm valid solutions and find the optimum. The employment of a ILP solver using branch-and-bound method is not recommendable, due to the great number of decision variables of real problems. Therefore, for implementations with more partitions or possible target devices the problem has to be split up and transferred into subproblems.
3.4 The Lagrangian Approach with Subgradient Optimization The high complexity of the minimization problem does not only derive by the function itself, but also by the corresponding constraints. Various lower bounds on the function Z (see (A)) can be obtained by relaxing subsets of the constraints (B), (C) or (D) either completely or in a Lagrangian fashion [3,10]. The corresponding Lagrangian dual bound can be found by relaxing the constraint (C) and adding it to the function. This constraint for full allocation of all clusters to devices will be used for the relaxation, because the use of other constraints leads to higher complexity (proof see [4]). The Lagrangian multiplier vector λ ∈ Rm is attached to the objective function and the result gives a lower bound on the optimal solution to the original problem. Minimize: LR =
∑ ∑
i∈C j∈D
=
∑
∑ ∑
j∈D i∈C
subject to
∑
∑
c ⋅x + g ⋅ F ⋅y + λ ⋅ 1− x ij ij b j j ij i j∈D i∈C j∈D
( c ij − λ ) ⋅ x ij + g b ⋅ ∑ F j ⋅ y j + i j∈D
∑
a ⋅x ≤K ⋅A ⋅y , ij ij j j j i∈C x ≤ 1, ∀ ij x ≥ 0, y ∈ N , ∀ ij j 0
∑ λi
(E)
i∈C
∀ j ∈ D,
(F)
i ∈ C, j ∈ D, i ∈ C, j ∈ D.
(G) (H)
In order to maximize the lower bounds the following Lagrangian dual problem has to be solved: max LR min ( λ ) = max {min { LR ( x, y, λ ) } } λ ∈R x, y
(I)
This Lagrangian lower bound provides the basis for an efficient solution of the minimization problem. The best Lagrangian multipliers λ can be found by subgradient optimization [10], which can be applied because of the continuity and concavity of LRmin. This is an iterative procedure which, from an initial set, generates systematically further multipliers in order to maximize the lower bounds of function. For each multiplier vector λ, the device vector y and the partitions x an be calculated. Because of the relaxed constraint (C) now the problem can be solved optimal and independently for all devices by a Greedy algorithm with the reduced complexity of O(m⋅n log m). This solution is not necessarily feasible regarding the cluster partitions x because of the relaxed constraint (C). Therefore a feasible solution x for given devices y can be found by function (A), which now is reduced to Minimize subject to
ZS =
∑ ∑
i ∈ C j ∈ DS
c ij ⋅ x ij
(J)
∑ aij ⋅ xij ≤ Kj ⋅ Aj ⋅ yj ,
∀ j ∈ DS,
(K)
∑
∀ i ∈ C,
(L)
i∈C
j ∈ DS
x ij = 1 ,
x ij ≥ 0 , ∀ i ∈ C, j ∈ DS.
(M)
where Ds is the selected set of devices y. This equals a generalized transportation problem, which for example can be solved efficiently by network simplex method.
4 Experimental Results We tested the device selection tool on several Benchmark circuits. All displayed variables are used according to the definitions in formula (A). As a basis for the implementation table 3 shows a subset of the device library with the according numbers for cost (Fj) and area capacities (Kj). The costs of the devices are examples and normalized to the smallest chip (2064). For the following examples the number of devices n equals 13. Table 3: Family
Subset of the Device Library D
Archit. XC2000
Xilinx
XC3000
XC4000
ACT 1 Actel ACT 2
n=13
costs Fj
area Kj
As a hierarchical example we used a VHDL description of the 16 bit microprocessor DP16 from the VHDL-Cookbook [2]. This description was synthesized with CADDY [5] and resulted in a hierarchical netlist description. This netlist included all functional units and their corresponding gate representation. The processor includes an ALU unit with adders, comparators, several buffers, registers and a control unit. The goal was a programmable FPGA implementation, using Actel and Xilinx families shown in table 3. The optimization goal was the minimization of the total system implementation costs. Table 4:
MCNC-Benchmarks Implementations
Benchmark
# of Gates
# of Devices yj Cluster m
Devices used by [8]
#
C3540
1669
22
ACT 1240
C6288
2416
43
ACT 1020 XC 3042
1
XC3042
3
1 1
-
-
s1494
653
37
XC2064 ACT1010
1 1
-
-
s1423
804
37
XC2018
2
-
-
s13207
7951
35
A1240 XC2064 A1010
5 1 1
XC3020 XC3030 XC3042
3 5 4 -
#
2064
1
64
2018
1.5
100
3042
4.6
144
3064
8.2
224
3090
13.0
320
3195
36.0
484
4005
18.3
196
s15850
10369
38
A1240
6
-
4008
30.0
324
s35932
17793
53
4010
40.8
400
ACT 1010 ACT 1240
1 14
1010
2.9
295
XC3030 XC3042 XC3064 XC3090
1020
3.8
547
1225
4.6
451
1240
5.0
684
The first examples for the device selection shows the results for flat netlist descriptions (MCNC-Benchmarks). For these circuits a preclustering step [13] has to be performed in order to achieve a set C of the clusters i, which can be distributed to different devices j of the library D. The preclustering is a commonly used partitioning step to reduce the complexity of a following iterative optimization. Table 4 shows the number of gates and the number of clusters after the preclustering. The linear optimization of the cost function (A) is now able to calculate the device vector y. Due to the high number of clusters the cluster distribution vector x is not shown in table 4, however, the following example will give exact numbers for these variable. The right columns give a comparison to results by Kuz˘nar and Brglez [8], who used the FPGA family Xilinx 3000 for the partitioning, the integration of different architectures and technologies was not possible. Using the presented new device selection approach for different architectures cost minimized implementation could be achieved.
5 15 4 1
In a first step all clusters were mapped into the different technologies in order to calculate the costs for each cluster (see left columns of Table 5). Afterwards the partitioning and selection problem could be solved. The three columns 7,8,9 show the chosen devices (XC2018, A1020, A1240) for a given area utilisation of 100% (Aj=1). The numbers in percentage represent the proportion of the clusters being implemented in each device. The last row shows the percentage of area used on the chip (100% is the maximal chip capacity), which has to be reduced to 80% to ensure the routability. The results for the second iteration with the constraint of 80% area utilisation is shown in the three right columns. The three chosen XC2018 devices allow a mapping of the according clusters into this architecture and a following partitioning. The same can be calculated for the Actel devices. This simple example shows, that out of 13 different Actel and Xilinx devices three (five) cost minimizing chips for this implementation could be found. The computation time for all examples of table 4 and 5 including several changes of user constraints were below 15 seconds.
Table 5:
Cluster i
Result: Proportion xij of cluster i in device j
# of Logic Blocks used for each device family cij XC2000 XC3000 XC4000
alu_add_0
Implementation of the DP16 Processor
16
16
12
100% area utilisation: Aj=1
80% area utilisation: Aj=0.8
Act 1
Act 2
1 x XC2018
1 x A1020
1 x A1240
3 x XC2018
2 x A1240
32
31
0%
100 %
0%
0%
100 %
alu_addsub_0
32
32
24
65
65
0%
100 %
0%
100 %
0%
alu_addsub_1
32
32
24
65
65
0%
100 %
0%
0%
100 %
alu_cmp_0
16
16
12
62
32
100 %
0%
0%
100 %
0%
alu_cmp_1
16
16
12
62
32
100 %
0%
0%
100 %
0%
alu_cmp_2
16
16
12
62
32
100 %
0%
0%
100 %
0%
alu_cmp_3
16
16
12
62
32
100 %
0%
0%
100 %
0%
alu_sub_0
16
16
12
34
21
0%
0%
100 %
0%
100 %
buffer16_0
16
16
8
32
32
0%
100 %
0%
100 %
0%
buffer16_1
16
16
8
32
32
0%
100 %
0%
100 %
0%
cc_comp
2
2
2
4
4
100 %
0%
0%
0%
100 %
control
98
82
62
118
108
0%
100 %
0%
0%
100 %
latch16_0
16
16
8
32
32
26%
74%
0%
0%
100 %
latch16_1
16
16
8
32
32
100 %
0%
0%
100 %
0%
latch16_2
16
16
8
32
32
0%
100 %
0%
0%
100 %
latch3
3
3
2
6
6
100 %
0%
0%
0%
100 %
latch_buf16
48
32
32
81
65
0%
83%
17%
0%
100 %
mux2
4
4
2
8
8
100 %
0%
0%
100 %
0%
pc_reg
71
60
48
80
80
0%
100 %
0%
0%
100 %
5 Summary and Conclusions The paper showed a new approach to the problem of selecting suited target technologies and devices for complex circuits. Due to the evaluation steps the approach is able to handle very large circuits and the cost function allows different user constraints to be considered during the optimization. The partitioning and device selection step could be formulated as a general facility location problem, which has to be minimized. By Langrangian relaxation the constraints of the function could be transferred in a dual lower bound function, which could be solved efficiently. The examples showed the optimization of actual implementation problems and for different user constraints good results could be found. The future work will concentrate on extending the device/module library an adding interfaces to further synthesis tools.
6 References [1]
C.H. Aikens, “Facility location models for distribution planning”, European Journal of Operational Research 22, pp. 263-279, 1985.
[2]
P. Asherden, “The VHDL-Cookbook”, Department of Computer Science, University of Adelaide, 1990.
[3]
J. Beasly, “Lagrangean heuristics for location problems”, European Journal of Operational Research 65, pp. 383-399, 1993.
[4]
G. Cornuejols, R. Sridharan and J. Thizy, “A comparision of heuristics and relaxations for the Capacitated Plant Location Problem”, Euro. Journ. of Operational Research 50, pp. 280-297, 1991.
[5]
P. Gutberlet, W. Rosenstiel, “Timing Preserving Interface transformations for the Synthesis of behavioural VHDL”, Proceedings of the EURO-DAC 94, Grenoble, 1994.
[6]
K.H. Kernighan, S. Lin, “An Efficient Heuristic Procedure for Partitioning Graphs“, Bell System Technical Journal, vol. 49, no. 2, 1970.
[7]
S. Kirkpatrick, C. Gelatt and M. Vecchi, “Optimization by Simulated Annealing”, Science, vol. 220, no. 4589, pp.671-680, 1983.
[8]
R. Kuz˘nar, F. Brglez and K. Kozminski, “Cost Minimization of Partitions into Multiple Devices”, Proceedings of DAC 93, pp. 315-320, 1993
[9]
H. Owen, U. Khan, J. Hughes, “FPGA-based Emulator Architectures”, Int. Workshop on Field Programmable Logic and Applications, Oxford, Sept. 1993.
[10]
C. Reeves, “Modern Heuristic Techniques for Combinational Problems”, Blackwell Scientific Publications, Oxford, 1993, pp. 266-276.
[11]
U. Weinmann, W. Rosenstiel, “Technology Mapping for Sequential Circuits based on Retiming Techniques”, Proceedings of the EURO-DAC 93, Hamburg, 1993.
[12]
U. Weinmann, W. Rosenstiel, “Universal Technology Mapping for Table-Lookup FPGAs”, IFIP Workshop on Logic and Architecture Synthesis, Grenoble, 1993.
[13]
U. Weinmann, W. Rosenstiel, “Network Flow based Clustering and Partitioning for FPGAs”, IFIP Workshop on Logic and Architecture Synthesis, Grenoble, Dec. 1994.