Device Selection for System Partitioning - Department of Computer ...

0 downloads 0 Views 36KB Size Report
equivalent modul exists in module library then ... 3.3 Linear Optimization of the Distribution Pro- ... linear program like the capacitated facility location problem.
Device Selection for System Partitioning* Ulrich Weinmann1, Oliver Bringmann1 and Wolfgang Rosenstiel2 1

Computer Science Research Center at the University of Karlsruhe (FZI) Haid-und-Neu-Str. 10-14, 76131 Karlsruhe, Germany 2 FZI and University of Tübingen, Sand 13, 72076 Tübingen, Germany Abstract

This paper presents a new approach to the problem of selecting suited target technologies and devices for complex circuits. The implemented algorithm is based on formulating partitioning and device selection as a general facility location problem. The complexity of this NP-hard problem is reduced by transferring the constraints of the function in a dual lower bound function using Lagrangian relaxation and subgradient optimization. The evaluated sample circuits, generated by High-Level Synthesis tools, show cost minimized implementations using low computing times.

1 Introduction

ogy to the annealing process in physics to avoid local minima, the Min-Cut two-way partitioning [6] is very efficient due to group swapping. These basic partitioning algorithms have been improved in the past, mostly by reducing the calculation complexity to receive faster partitioning results. In 1993 Kuz˘nar, Brglez and Kozminski [8] first showed the necessity of the relationship between partitioning and technology mapping, using an improved Min-Cut approach. Here the circuit is mapped into one architecture and the resulting logic blocks can be distributed to different devices of this architecture under minimizing a cost function. Different target architectures or functional preferences of circuits to specific devices are not considered.

2 Overview of the Device Selection Approach The partitioning of logic structures has become a very important topic for system implementations. Especially the problem of realizing large circuits by programmable or standard devices is still not extensively investigated in the sense of cost minimization, timing optimization or optimal use of area. An example for that are available partitioning tools for rapid prototyping systems using only one programmable architecture. For larger circuits the use of area per device is rather low and the timing shows problems due to a high number of interconnections between chips [9]. In practice today the designer has to partition the circuit manually and decides about the best suited target technology. This is often not possible due to the high number of device and circuit specific data, which have to be considered. Therefore the problem of partitioning circuits into different target architectures including an automatic device selection is not yet solved. Many approaches in this field are focused on the optimization of the partitioning problem, including area, timing and interconnection minimization. However, an exact solution to this problem can’t be found in the sense that no polynomial-time algorithm for it is known. The clustering partitioning can be divided into seed selection, unplaced node selection and node placement. This approach is easy to implement, but depends strongly on the selection of the seed nodes. Besides simulated annealing [7], which uses an anal* This research is supported by the ESPRIT project 6855 „LINK“

This paper proposes an automatic device selection and partitioning approach into different architectures. The approach is able to respect functional dependencies between the circuit and the target devices in order to achieve optimized implementation results. Figure 1 shows a general overview. The system starts with a pre-partitioned netlist, which can be derived from hierarchical clusters (generated by High-Level Synthesis) or a clustering step by the partitioning tool. Using this module description the device selection tool generates a cost function including all user constraints and available library data, which include programmable (FPGA/PLD) and standard devices. A technology mapping tool will be used to calculate implementation data like area or delay for each technology. Now the device selection step solves two implementation problems, which heavily depend on each other. On the one hand for each cluster a suited target technology has to be found. On the other hand each cluster itself can be partitioned and the resulting subclusters have to be assigned to a target technology. Therefore the optimization problem is the question of which portion of a cluster to implement in which target technology. Chapter 3.3 in this paper will show that both goals can be combined to one function. The system includes a technology mapping for universal architectures [11,12] and a maximum flow based

partitioning approach [13]. After minimizing the cost function a board layout with devices and interconnections will be generated. The design can directly be implemented, because all partitions are already mapped into the according devices. For the first time also standard devices can automatically be included in the design process, which are especially useful for overall cost minimized implementations. The use of mask programmed ASIC is possible, if the user constraints can not be met by programmable logic.

(1)

Specification of the primary optimization goal Reduction of the device library according to user constraints

(2)

Analyse netlist & Cost calculation / estimation

(3)

Device selection and partitioning

all clusters mapped?

Technology mapping of not yet mapped clusters

no

yes User Constraints

Device/ ModuleLibrary

Hierarchical Netlist

Device Selection Constraints Calculation

Partitioning according to the selected devices

partitioning possible?

Partitioning Tool Constraints

yes

FPGA

(C)PLD

Capacity of device exceeded?

yes

Reduction of the capacity allowed for the device

no Implementation

Standard Devices

Figure 2: Figure 1:

Raising of the costs of not partitioned clusters

Placement & Routing

Constraints Technology Mapping Tool

ASIC

no

Overview of the System

3 Device Selection and Partitioning The selection of suitable target technologies for a circuit is the most important step during the system partitioning. This device selection can be divided into three steps. During the first step the user defines the primary optimization goal of the implementations and special constraints, which have to be considered. In the second step the system performs a evaluation of the circuit and the user constraints and puts limitations on the optimization step (reduction of the device library etc.). The last step the actual implementation is computed.

3.1 Constraints and Optimization Goal The most important influence derives from the user constraints, which will be divided into two categories. The system constraints define the overall optimization goal for the partitioning (see table 1). The first factor minimizes the total cost of an implementation, that means the cheapest realisation will be chosen using the device costs of the library. The minimization of the number of devices results in the implementation with the lowest number of chips. To optimize the system speed critical paths are clustered and the wires on a board layout will be minimized by reducing the number of interconnections between the chips (see partitioning tool [13]). Table 1:

(1) User constraints and optimization goal (2) Evaluation of circuit and constraints (3) Cost function and mixed integer linear optimization Figure 2 shows a more detailed overview of the following steps. After constraint specification (1), and netlist analysis (2), which includes the technology mapping, a cost function is generated including all available data. After minimizing this cost function, the system results with an allocation of clusters, or portions of clusters, to target devices. If all clusters are mapped into these target technologies, the clusters have to be partitioned according to the results of the cost function. If this is possible and all devices could be routed, the system can be implementedIn the following the first three steps will be discussed in more detail. Main emphasis will be put on the device selection problem.

Optimization Flow

System Constraints

User Constraints Device Constraints

1

Overall costs

Reprogrammability

2

Number of devices

Testability (e.g. Scan Path)

3

Speed

Power consumption (3V)

4

Wires on the board

Range of temperature (mil)

5

--

Technology (e.g. Anti-Fuse, SRAM, (E)EPROM)

6

--

Reliability

The device constraints are user specific adjustments, which put preferences on special devices. In comparison to the system constraints they do not influence the partitioning or technology mapping optimization steps. These constraints help to reduce the library of possible target technol-

ogies and thereby also the complexity of the selection problem. If, for example, the user puts restrictions on the power consumption of a device, the target library will be reduced to devices, which meet this constraint.

3.2 Evaluation of the Circuit During this phase all clusters of the netlist are examined for their use of combinational gates, flipflops, edges or pins. Furthermore, special circuit attributes, like 2-, 4- or more level descriptions, are evaluated. Table 2:

Cluster Attributes and Device Preferences

Cluster Attributes

Preference for Device

• High use of storage elements

⇒ Architectures with predefined storage elements

• High proportion of combinational gates

⇒ Architectures without predefined storage elements

• Datapath implementations

⇒ FPGAs

• Two/four-level circuits (e.g. FSM)

⇒ (C)PLDs

• Time critical clusters

⇒ Fast devices, (C)PLDs

• Standard modules (Add, Mult)

⇒ Standard devices

Using this information, preferences can be derived for partitions, which can efficiently be implemented in special devices. The device library includes information for the suitability of such circuit constraints for special technologies like: It should be mentioned, that standard module implementations or user experience can be added to the library. For the device selection step these dependencies are only considered, however in order to minimize the overall cost function the resulting target devices can differ. This cost function can be calculated using different implementation factors. Beside the already discussed preference factor, the area, delay and total cost of the implementation can be weighted. The following algorithm shows the strategy to calculate a cost function for every cluster in a circuit being implemented in every device of the library. The first factor of the cost function considers the delay. Its weight factor can be controlled by the user constraints. The second factor combines area consumption and device cost, which equals the cost of a cluster being mapped in a special device. The third factor reflects the preferences. Generally, this function includes the basic implementation goal, however further factors can be added. The computation speed of this step can be adjusted by using estimators instead of actual implementation tools. Experimental results showed, that even larger numbers of clusters can be mapped with low computational costs because of their relatively small size. The actual time consuming problem is the device selection or cluster distribution problem, which has to be solved.

for all clusters i calculate area consumption (e.g. gate count) if area > amax then pre-partitioning of the cluster i for all devices j if equivalent modul exists in module library then take implementation costs cij from module library else calculate / estimate the costs aij, tij, pij calculate implementation costs of clusters i in device j

c ij = w t ⋅ t ij + w a ⋅

F j ⋅ a ij Kj

+ w p ⋅ p ij

end if end for end for

where: cij = costs to implement cluster i in device j, wt, wa, wp = weight factors to adjust costs for implementation and devices, tij = maximal delay of cluster i in device j, Fj = costs of device j, aij = use of area of clusters i in device j, Kj = total area of device j, pij = preference factor for implementing cluster i in device j, a max = predefined upper border for cluster sizes. Figure 3:

Cost Calculation / Estimation

3.3 Linear Optimization of the Distribution Problem The goal of this optimization step is an optimal selection and allocation of clusters in a netlist to one or more suited target devices. Figure 4 depicts the situation. The preselection step of chapter 3.1 reduces the number of devices, the partitioning or hierarchical division of the system description produces the clusters i. Each cluster or subsets of these clusters have to be assigned to suited devices.

cij Netlist with clusters i Figure 4:

Fj devices j

device library reduced library

Partitioning and Selection Problem

In order to give a mathematical description of this problem, we use a mixed integer programming formulation. Generally the problem of distributing partitions into one or more target devices can be adopted from the facility (or plant, warehouse) location problem (FLP). Here the goal is to minimize the costs to open a new warehouse and the costs of transporting goods from a warehouse to a customer. With given customer locations, the locations for the warehouses

must be selected to minimize the transport costs. The solution to this problem gives the optimal locations for facilities to satisfy all customers demands. This problem formulation can be transferred to the device selection problem. In this case different circuit partitions have to be distributed to several devices under special cost criterias. According to formula of [1] the problem can be adapted and formulated as follows: Minimize:

Z =

∑ ∑ cij ⋅ xij + wd ⋅∑ Fj ⋅ yj

i∈C j∈D

subject to

(A)

j∈D

∑ aij ⋅ xij ≤ Kj ⋅ Aj ⋅ yj ,

∀ j ∈ D,

(B)

∑ xij

∀ i ∈ C,

(C)

i∈C

j∈D

= 1,

0 ≤ x ij ≤ 1, y j ∈ N 0 ,

∀ i ∈ C, j ∈ D, (D)

where cij = costs to implement cluster i in device j, xij = portion of area of cluster i in device j, yj = number of devices of type j , Fj= costs of device j, aij= use of area of clusters i in device j, Kj= total area of device j, Aj= proportion of Kj, that should be used j (0 ≤ Aj ≤ 1), wd= weight to adjust costs for implementation and devices C= clusters C = {1,…,m}, D = devices D = {1,…,n}. This function can also be formulated as a mixed integer linear program like the capacitated facility location problem [1]. In this formulation, j indexes the devices or chips, which are available for the realisation, and i the clusters of the circuit to be implemented in the devices j. cij equals the costs for a cluster to be implemented in device j, while xij allows a portion of a cluster to be implemented (this produces the input to the partitioning tool). The product (Fi·yi) adds the device cost to the formula. Very important are the constraints (B) and (C), which demand the implementation not to grow larger than the capacity of the device. (B) restricts the maximal capacity of a device, while (C) secures the allocation of all partitions of a cluster. Further constraints like limiting the device or I/O count can be added without changing the complexity [4]. Constraint (D) assigns integers to the number of devices. A solution to this minimization problem can be found in different ways. For smaller problems (e.g. n,m < 10) an exhaustive search can calculate all nm valid solutions and find the optimum. The employment of a ILP solver using branch-and-bound method is not recommendable, due to the great number of decision variables of real problems. Therefore, for implementations with more partitions or possible target devices the problem has to be split up and transferred into subproblems.

3.4 The Lagrangian Approach with Subgradient Optimization The high complexity of the minimization problem does not only derive by the function itself, but also by the corresponding constraints. Various lower bounds on the function Z (see (A)) can be obtained by relaxing subsets of the constraints (B), (C) or (D) either completely or in a Lagrangian fashion [3,10]. The corresponding Lagrangian dual bound can be found by relaxing the constraint (C) and adding it to the function. This constraint for full allocation of all clusters to devices will be used for the relaxation, because the use of other constraints leads to higher complexity (proof see [4]). The Lagrangian multiplier vector λ ∈ Rm is attached to the objective function and the result gives a lower bound on the optimal solution to the original problem. Minimize: LR =

∑ ∑

i∈C j∈D

=



∑ ∑

j∈D i∈C

subject to





c ⋅x + g ⋅ F ⋅y + λ ⋅ 1− x ij ij b j j ij i j∈D i∈C j∈D

( c ij − λ ) ⋅ x ij + g b ⋅ ∑ F j ⋅ y j + i j∈D



a ⋅x ≤K ⋅A ⋅y , ij ij j j j i∈C x ≤ 1, ∀ ij x ≥ 0, y ∈ N , ∀ ij j 0

∑ λi

(E)

i∈C

∀ j ∈ D,

(F)

i ∈ C, j ∈ D, i ∈ C, j ∈ D.

(G) (H)

In order to maximize the lower bounds the following Lagrangian dual problem has to be solved: max LR min ( λ ) = max {min { LR ( x, y, λ ) } } λ ∈R x, y

(I)

This Lagrangian lower bound provides the basis for an efficient solution of the minimization problem. The best Lagrangian multipliers λ can be found by subgradient optimization [10], which can be applied because of the continuity and concavity of LRmin. This is an iterative procedure which, from an initial set, generates systematically further multipliers in order to maximize the lower bounds of function. For each multiplier vector λ, the device vector y and the partitions x an be calculated. Because of the relaxed constraint (C) now the problem can be solved optimal and independently for all devices by a Greedy algorithm with the reduced complexity of O(m⋅n log m). This solution is not necessarily feasible regarding the cluster partitions x because of the relaxed constraint (C). Therefore a feasible solution x for given devices y can be found by function (A), which now is reduced to Minimize subject to

ZS =

∑ ∑

i ∈ C j ∈ DS

c ij ⋅ x ij

(J)

∑ aij ⋅ xij ≤ Kj ⋅ Aj ⋅ yj ,

∀ j ∈ DS,

(K)



∀ i ∈ C,

(L)

i∈C

j ∈ DS

x ij = 1 ,

x ij ≥ 0 , ∀ i ∈ C, j ∈ DS.

(M)

where Ds is the selected set of devices y. This equals a generalized transportation problem, which for example can be solved efficiently by network simplex method.

4 Experimental Results We tested the device selection tool on several Benchmark circuits. All displayed variables are used according to the definitions in formula (A). As a basis for the implementation table 3 shows a subset of the device library with the according numbers for cost (Fj) and area capacities (Kj). The costs of the devices are examples and normalized to the smallest chip (2064). For the following examples the number of devices n equals 13. Table 3: Family

Subset of the Device Library D

Archit. XC2000

Xilinx

XC3000

XC4000

ACT 1 Actel ACT 2

n=13

costs Fj

area Kj

As a hierarchical example we used a VHDL description of the 16 bit microprocessor DP16 from the VHDL-Cookbook [2]. This description was synthesized with CADDY [5] and resulted in a hierarchical netlist description. This netlist included all functional units and their corresponding gate representation. The processor includes an ALU unit with adders, comparators, several buffers, registers and a control unit. The goal was a programmable FPGA implementation, using Actel and Xilinx families shown in table 3. The optimization goal was the minimization of the total system implementation costs. Table 4:

MCNC-Benchmarks Implementations

Benchmark

# of Gates

# of Devices yj Cluster m

Devices used by [8]

#

C3540

1669

22

ACT 1240

C6288

2416

43

ACT 1020 XC 3042

1

XC3042

3

1 1

-

-

s1494

653

37

XC2064 ACT1010

1 1

-

-

s1423

804

37

XC2018

2

-

-

s13207

7951

35

A1240 XC2064 A1010

5 1 1

XC3020 XC3030 XC3042

3 5 4 -

#

2064

1

64

2018

1.5

100

3042

4.6

144

3064

8.2

224

3090

13.0

320

3195

36.0

484

4005

18.3

196

s15850

10369

38

A1240

6

-

4008

30.0

324

s35932

17793

53

4010

40.8

400

ACT 1010 ACT 1240

1 14

1010

2.9

295

XC3030 XC3042 XC3064 XC3090

1020

3.8

547

1225

4.6

451

1240

5.0

684

The first examples for the device selection shows the results for flat netlist descriptions (MCNC-Benchmarks). For these circuits a preclustering step [13] has to be performed in order to achieve a set C of the clusters i, which can be distributed to different devices j of the library D. The preclustering is a commonly used partitioning step to reduce the complexity of a following iterative optimization. Table 4 shows the number of gates and the number of clusters after the preclustering. The linear optimization of the cost function (A) is now able to calculate the device vector y. Due to the high number of clusters the cluster distribution vector x is not shown in table 4, however, the following example will give exact numbers for these variable. The right columns give a comparison to results by Kuz˘nar and Brglez [8], who used the FPGA family Xilinx 3000 for the partitioning, the integration of different architectures and technologies was not possible. Using the presented new device selection approach for different architectures cost minimized implementation could be achieved.

5 15 4 1

In a first step all clusters were mapped into the different technologies in order to calculate the costs for each cluster (see left columns of Table 5). Afterwards the partitioning and selection problem could be solved. The three columns 7,8,9 show the chosen devices (XC2018, A1020, A1240) for a given area utilisation of 100% (Aj=1). The numbers in percentage represent the proportion of the clusters being implemented in each device. The last row shows the percentage of area used on the chip (100% is the maximal chip capacity), which has to be reduced to 80% to ensure the routability. The results for the second iteration with the constraint of 80% area utilisation is shown in the three right columns. The three chosen XC2018 devices allow a mapping of the according clusters into this architecture and a following partitioning. The same can be calculated for the Actel devices. This simple example shows, that out of 13 different Actel and Xilinx devices three (five) cost minimizing chips for this implementation could be found. The computation time for all examples of table 4 and 5 including several changes of user constraints were below 15 seconds.

Table 5:

Cluster i

Result: Proportion xij of cluster i in device j

# of Logic Blocks used for each device family cij XC2000 XC3000 XC4000

alu_add_0

Implementation of the DP16 Processor

16

16

12

100% area utilisation: Aj=1

80% area utilisation: Aj=0.8

Act 1

Act 2

1 x XC2018

1 x A1020

1 x A1240

3 x XC2018

2 x A1240

32

31

0%

100 %

0%

0%

100 %

alu_addsub_0

32

32

24

65

65

0%

100 %

0%

100 %

0%

alu_addsub_1

32

32

24

65

65

0%

100 %

0%

0%

100 %

alu_cmp_0

16

16

12

62

32

100 %

0%

0%

100 %

0%

alu_cmp_1

16

16

12

62

32

100 %

0%

0%

100 %

0%

alu_cmp_2

16

16

12

62

32

100 %

0%

0%

100 %

0%

alu_cmp_3

16

16

12

62

32

100 %

0%

0%

100 %

0%

alu_sub_0

16

16

12

34

21

0%

0%

100 %

0%

100 %

buffer16_0

16

16

8

32

32

0%

100 %

0%

100 %

0%

buffer16_1

16

16

8

32

32

0%

100 %

0%

100 %

0%

cc_comp

2

2

2

4

4

100 %

0%

0%

0%

100 %

control

98

82

62

118

108

0%

100 %

0%

0%

100 %

latch16_0

16

16

8

32

32

26%

74%

0%

0%

100 %

latch16_1

16

16

8

32

32

100 %

0%

0%

100 %

0%

latch16_2

16

16

8

32

32

0%

100 %

0%

0%

100 %

latch3

3

3

2

6

6

100 %

0%

0%

0%

100 %

latch_buf16

48

32

32

81

65

0%

83%

17%

0%

100 %

mux2

4

4

2

8

8

100 %

0%

0%

100 %

0%

pc_reg

71

60

48

80

80

0%

100 %

0%

0%

100 %

5 Summary and Conclusions The paper showed a new approach to the problem of selecting suited target technologies and devices for complex circuits. Due to the evaluation steps the approach is able to handle very large circuits and the cost function allows different user constraints to be considered during the optimization. The partitioning and device selection step could be formulated as a general facility location problem, which has to be minimized. By Langrangian relaxation the constraints of the function could be transferred in a dual lower bound function, which could be solved efficiently. The examples showed the optimization of actual implementation problems and for different user constraints good results could be found. The future work will concentrate on extending the device/module library an adding interfaces to further synthesis tools.

6 References [1]

C.H. Aikens, “Facility location models for distribution planning”, European Journal of Operational Research 22, pp. 263-279, 1985.

[2]

P. Asherden, “The VHDL-Cookbook”, Department of Computer Science, University of Adelaide, 1990.

[3]

J. Beasly, “Lagrangean heuristics for location problems”, European Journal of Operational Research 65, pp. 383-399, 1993.

[4]

G. Cornuejols, R. Sridharan and J. Thizy, “A comparision of heuristics and relaxations for the Capacitated Plant Location Problem”, Euro. Journ. of Operational Research 50, pp. 280-297, 1991.

[5]

P. Gutberlet, W. Rosenstiel, “Timing Preserving Interface transformations for the Synthesis of behavioural VHDL”, Proceedings of the EURO-DAC 94, Grenoble, 1994.

[6]

K.H. Kernighan, S. Lin, “An Efficient Heuristic Procedure for Partitioning Graphs“, Bell System Technical Journal, vol. 49, no. 2, 1970.

[7]

S. Kirkpatrick, C. Gelatt and M. Vecchi, “Optimization by Simulated Annealing”, Science, vol. 220, no. 4589, pp.671-680, 1983.

[8]

R. Kuz˘nar, F. Brglez and K. Kozminski, “Cost Minimization of Partitions into Multiple Devices”, Proceedings of DAC 93, pp. 315-320, 1993

[9]

H. Owen, U. Khan, J. Hughes, “FPGA-based Emulator Architectures”, Int. Workshop on Field Programmable Logic and Applications, Oxford, Sept. 1993.

[10]

C. Reeves, “Modern Heuristic Techniques for Combinational Problems”, Blackwell Scientific Publications, Oxford, 1993, pp. 266-276.

[11]

U. Weinmann, W. Rosenstiel, “Technology Mapping for Sequential Circuits based on Retiming Techniques”, Proceedings of the EURO-DAC 93, Hamburg, 1993.

[12]

U. Weinmann, W. Rosenstiel, “Universal Technology Mapping for Table-Lookup FPGAs”, IFIP Workshop on Logic and Architecture Synthesis, Grenoble, 1993.

[13]

U. Weinmann, W. Rosenstiel, “Network Flow based Clustering and Partitioning for FPGAs”, IFIP Workshop on Logic and Architecture Synthesis, Grenoble, Dec. 1994.

Suggest Documents