An Algorithm for Partitioning of Application Specific Systems Zebo Peng E-mail:
[email protected]
Krzysztof Kuchcinski E-mail:
[email protected]
Abstract This paper presents a simulated-annealing based algorithm to partition an application specific system into a set of modules. The role of partitioning is to discover the structure implicit in the functional specification of the system so as to guide high level synthesis decisions in a design environment for digital systems consisting of hardware parts and possibly software components. The partitioning algorithm can also be used to partition the final or intermediate results of a high-level synthesis process into several physical blocks. Experimental results show that our approach produces better register-transfer designs with less global communications. This work has been supported by the Swedish National Board for Industrial and Technical Development (NUTEK). Published in Proceedings of the European Conference on Design Automation, EDAC’93, Paris, France, February 22-25, 1993
IDA Technical Report 1994 LiTH-IDA-R-94-01 ISSN-0281-4250
Department of Computer and Information Science, Linköping University, S-581 83 Linköping, Sweden
1. Introduction We are developing a design methodology which allows integration of the software and hardware design activities. In particular, we are designing and implementing a silicon/software compilation environment that will allow system designers to start the design process with a high-level behavioral specification and generate a hardware implementation at register-transfer level as well as a software program which together will implement the given specification. Our approach is based on a Petri net based design representation which captures design results in the intermediate design stages so that hardware/software trade-offs can be easily made and evaluated. In this paper, we will present a multi-way partitioning algorithm to divide the functionality of a system into a set of modules, each of which corresponds to a physical unit such as a chip or a software package. The role of partitioning is to discover the structure implicit in the functional specification of the system so as to guide other synthesis decisions. The partitioning process is formulated as a graph partitioning problem. We have developed a set of heuristics to convert our Petri net based intermediate design representation into a single graph notation. A simulated-annealing [8] based algorithm has been implemented to partition this graph into a set of sub-graphs. The partitioning results are then mapped back into a set of design representations with well-defined interface. Each design representation can then be compiled into a register-transfer level hardware by a high-level synthesis tool or converted into a software procedure. Several architectural level partitioning algorithms are described in the literature [11], [3], [10]. McFarland uses mainly the similarity of functions based on the amount of common functionality and the amount of communication between them to cluster the functions in a ISPS description [11]. Camposano uses also clustering technique to partition a digital design into several blocks before logic synthesis is performed on the individual blocks. His algorithm takes into account similarity of the functions and common data carriers and contributes to the reduction of the time spent in logic synthesis as well as improvement of the synthesis results [3]. Our partitioning algorithm takes into account not only the data path design but also the controller structure. This is done by mapping the control flow and data flow notation of our design representation into a single partitioning graph and introducing different arcs between nodes to reflect the closeness between them in terms of different design aspects. In this sense, our approach is similar to the work by [10] which uses a multistage clustering technique to partition behavioral descriptions taking into account different closeness criteria. However, Lagnese’s approach deals with one partitioning criterion at a time at each of the clustering stage and is only applied to partition input
-1-
behavioral specifications. Our partitioning algorithm is defined based on the intermediate design representation and can, therefore, be applied during any step of the architectural, high-level, and register-transfer level synthesis processes. For example, the partitioning algorithm can first be used to partition a design into a number of modules so that the data path allocation algorithm can bind operations into the same physical module according to the partitioning information. At a later stage it can be used to partitioning the detailed design so as to guide the floor-planning procedure. Another main feature of our approach is that it addresses also the problem of hardware/software partitioning. This is achieved by taking into account the dynamic profile information of a design and clustering operations with similar utilization frequencies together. The dynamic information of a design is collected by simulating the design representation in our design environment. To give a precise evaluation of the hardware/software partitioning results, however, a target execution system for the software must be specified, which is out of the scope of this paper. We make therefore a simple assumption here that the software will run on a given microprocessor. This paper first gives an overview of the design environment we are developing and then concentrates on the partitioning problem of hardware. The detailed formulation of the partitioning procedure is presented in section 3 and the partitioning algorithm is described in section 4. Section 5 gives a summary of the preliminary experimental results of using the partitioning algorithm. It shows that using the partitioning algorithm, improved register-transfer designs are produced. In particular the number of global routing wires is generally reduced.
2. Overview of the Design Environment The overall structure of the proposed design environment is illustrated in Figure 1. We briefly discuss each part in this section; for more details, see [13]. The starting point of our design process is a set of design constraints together with a behavioral specification in the form of a program written in a high-level language. The input specification is given by a PASCAL-like language, called ADDL (Algorithmic Design Description Language) [6]. ADDL consists of a subset of PASCAL with several extensions to capture parallelism and hardware specific operations. The compiler front-end does an extensive ADDL program analysis and generates as output an initial design representation in the Petri net based notation. The input specification can also be specified in VHDL which will be compiled also into the Petri net based representations [5]. We select to use the Petri net notation because it allows us to describe the partial ordering relation over a set of places and transitions [14]. The places are used in our approach to represent operations and the transition synchronization of a system. If two operations are not in the partial order-
-2-
Constraints
Behavioral Specification
Stimuli Input
Compiler Front-end Design Constraints
Intermediate Representation
Performance Statistics Simulator Transformations/ Optimization
Partitioning
Hardware Synthesizer
Software compiler
Hardware Structures
Software Components
Figure 1. Overview of the proposed design environment
ing relation, they are causally independent and may occur in either order or simultaneously. Therefore, the partial ordering of operations can be used to express the concurrent and asynchronous aspects of a hardware/software system. The Petri net notation is extended to include a data flow description of the basic operations. The data flow description specifies the operation to be performed, the set of variables which are used by the operation, and the set of variables which are assigned by the operation. These operations are to be performed when their corresponding place is holding a token [13]. An example the Petri net based notation is illustrated in Figure 2(a) and 2(b). Figure 2(a) depicts a Petri net which describe the control flow. A control state is defined as a marking of the Petri net, i.e., the possession of tokens in a subset of the places of the Petri net which are depicted as circles. The transitions of control states are represented as firings of one or several transitions of the Petri net which are depicted as bars. Different from ordinary Petri nets, the transitions can be guarded by conditions which must be true before the transitions can be fired. The guarding conditions are generated by some data path operations. Figure 2(b) depicts the data flow graph of the given example. The Petri net based representation is used to capture the intermediate results of design transformations/optimization. This representation captures explicitly parallel computations and allow hard-3-
ware/software partitioning to be done in different ways. It is also formal and executable, which allows the designer to use verification and evaluation techniques to analyze the intermediate design and make appropriate design trade-offs. One of the evaluation techniques we have implemented is a simulator which executes the design representation with typical input data and collects statistics about operation utilization and control flow choices in a given design. The partitioning algorithm uses the user-given design constraints, the intermediate representation and the performance statistics to guide the partitioning of a design into software and hardware sub-systems. After the partitioning is done, the hardware implementation is generated by CAMAD [12], a high-level silicon compiler. The hardware implementation is considered as a co-processor which will be interacted with the software generated by a compiler and running on a given microprocessor. The design environment is targeted towards design of application specific systems such as real time controllers and communication protocols which have tightly coupled hardware and software components.
3. The Partitioning Procedure The partition algorithm makes heavily use of the simulation results, especially the critical regions (operations with high utilization frequencies) identified by the simulator so as to “allocate” hardware for the most performance-critical parts of a design and leave the rest for less expensive software implementation. It takes also into account connectivity of the system components as well as the impact of communication frequency and delay on the system performance when making decisions about design partitioning. To make it possible for designers to emphasize certain aspect over the others, different types of arcs are introduced when converting the design representation into a graph. The basic idea is to first construct a graph with nodes and edges and then use a simulated-annealing graph partitioning algorithm to partition it into a set of sub-graphs. In traditional hardware structural partitioning, a node of the graph represents a basic hardware component with its weight being the cost of the component. An edge between two nodes represents the connection of the two components with its weight being the cost of implementing the connection. A partitioning algorithm decomposes the graph into a set of sub-graphs so as to minimize the sum of the cost on all cut edges, while trying to balance the total weights of each of the sub-graphs. The main difference with our situation is that with each operation there will also be a measurement of utilization frequency, which represent the percentage of time when this operation is executed. Those nodes having similar utilization frequencies should be clustered into the same sub-graph so that the deviation of utilization frequencies within each partition will be very small. This will allow the designers to select software implementation for sub-graphs with small average utilization fre-4-
quencies and hardware implementation for the others. In our approach, the graph is constructed to have the same topological structure as the Petri net which is used to capture the control structure of a design, i.e., each place of the Petri net is mapped into a node. The weight on a node represents the estimated implementation cost of its related operation part (captured as a data flow description). Figure 2(c) depicts the partitioning graph generated from the given example with node Ni corresponding to place Pi in the Petri net. A C-edge connects two place nodes if there is a direct control flow relations between the two places, i.e., one of them is input to a transition and the other is output of the same transition. For example, in Figure 2(c), N1 is connected to N2 since P1 is input to a transition while P2 is output of the same transition. The weight of a C-edge equals the firing rate of the transition, i.e., the number of times the transition is fired during a single execution (activation) of the given design. The Cedge captures a similar idea of the control flow clustering of Lagnese’s approach [10]. However,
P1
P1: #0
P2
Y
C
#0 #0 X
P4: X
P3: IN P3
X
P5 P4
P6: Y
#0
≥
I
C P6
P2: X
Y
C P5: I
#1
+
+
Y
I
I
Out1 Out2 (b) Dataflow graph
(a) Control Petri net N1
0.09
1
10
N2
1
10 10
N6
D-edge N3
10 16
0.5
C-edge
U-edge
10
N5 N4 (c) The generated partitioning graph; all C-edges are included; but only examples of D-edges and U-edges are drawn to make it simple. Figure 2. An example of the design representations and its generated partitioning graph -5-
our algorithm can deal with cyclic control flow resulted from loops and is thus more powerful than Lagnese’s method. For example, the C-edge between N2 and N3 in Figure 2(c) has 10 as its weight because for each activation of the design, the transition between P2 and P3 will be fired 10 times1 due to the iteration of the loop consisting of P2, P3, P4 and P5. The C-edge connecting N1 and N2, on the other hand, has 1 as its weight because the transition connecting them will only be fired once. If the operations associated with the two places are related to each other, for example, they operate on the same data, a D-edge connects their corresponding nodes in the constructed graph. The weight on the D-edge is directly proportional to the degree of relation between the two operations. For example, since both P3 and P4 involve the variable X, a D-edge is used to connected N3 and N4 as shown in Figure 2(c), and its weight equals 16 which is the wordlength of the register used to store X. If there is either a C-edge or D-edge between two place nodes in the constructed graph, a U-edge will be introduced between the two nodes. The weight of the U-edge is inversely proportional to the difference between the utilization frequencies of their corresponding places plus 1. In this way, if two nodes are related either by control transfer or data transfer and have very similar utilization frequencies, there will be an edge between them with a relatively large weight. Therefore the partitioning algorithm will cluster them into the same partition with a high probability. For example, in the design given in Figure 2, the utilization frequencies are 1 for P1, 11 for P2, and 10 for P52; and therefore the U-edge between P2 and P5 will have the weight of 1/(11-10+1) = 0.5. The U-edge between P2 and P3, on the other hand, has a weight of 1/(11-1+1) = 0.09. The relation between the weights on the C-edges (wC), D-edges (wD) and U-edges (wU) are regulated by three different weight-multipliers (MC, MD, and MU) which can be controlled by the designers. Before the partitioning algorithm is applied to the graph, all the different types of edges between two nodes are combined into one and the weight of the combined edge between two nodes Ni and Nj are computed by the following formula: C
C
D
D
U
W i, j = w i, j × M + w i, j × M + w i, j × M
U
The designers can, for example, increase the U-edge weight-multiplier, MU, so as to stress the importance of clustering nodes with similar utilization frequencies. They can, in another situation, let the U-edge weight-multiplier be zero so that the utilization frequencies play no role in the par-
1. The estimated data, in this example, depends on the typical input data given to the simulator. 2. The utilization frequencies of different places can also be generated by the simulator. -6-
titioning process (which is useful when the algorithm is used to partition a hardware system into physical modules). The designers can also introduce other edges to reflect special design constraints.
4. The Simulated-Annealing Algorithm Since the graph partitioning problem with specified bound on the sizes of the resulting subgraphs is NP-complete [7], it is impossible to develope an efficient algorithm to find the optimal solutions for large graphs. Heuristics must be introduced and we have selected the simulated-annealing [8] based approach for performing partitioning of the final weighted graph. This approach has been selected because it avoids the problem of getting stuck at local optimum. The simulated-annealing algorithm utilizes an iterative improvement approach with random acceptance to carry out the partitioning task. It starts with an arbitrary initial configuration which is then subject to rearrangements of the elements in the configuration. The rearrangements are accepted whenever they improve the cost function. When worse results are obtained, the rearrangements are accepted randomly. The whole process is controlled be a temperature parameter which allows, in principle, more random acceptance in higher temperatures than in lower ones. The temperature is set to a high value at the beginning and reduced slowly until it becomes zero. The selection of temperatures and length of time spent in every temperature are two important parameters of the optimization process. Our implementation allows the designers to specify these two parameters to override the default values. Rearrangements of the graph are done by selecting randomly either the moving of one node to another partition or the swapping of two nodes. The cost function is defined by the sum of weights assigned to cut arcs, which is to be minimized under the constraints that the sum of weights assigned to the nodes in every partition will not differ more than a certain value (the value can be input as a parameter to the algorithm). The simulated-annealing algorithm can be illustrated by the following pseudo-code: Temp := Start_temperature; Conf := Initial_Configuration; while (cost changes) OR (Temp > Final_temperature) do for a number of times do generate a new configuration Conf’; if accept(Cost(Conf’), Cost(Conf), Temp) then Conf := Conf’ end for; Temp := Temp * Reduction_factor; end while;
-7-
with the acceptance function defined as: function accept(New_cost, Old_cost, Temp) Cost_change := New_cost - Old_cost; if Cost_change < 0 then accept := TRUE; (* accept based on cost improvement *) else Y := exp(-Cost_change/Temp); R := random(0, 1); if (R < Y) then accept := TRUE; (* accept randomly *) else accept := FALSE; (* do not accept *) end function;
The use of the simulated-annealing optimization algorithm for graph partitioning gives us a freedom of algorithm tuning for different applications by changing some of the parameters used by the algorithm such as number of partitions, starting temperature, temperature reduction factor, etc.
5. Experimental Results We have implemented the simulated-annealing based partitioning algorithm and run several designs to test the partitioning procedure, the results of our experiments are summarized in Table 1. All the results are produced with the following parameters: 1) the starting temperature is set to be 400; 2) the temperature reduction factor is set to 0.985; 3) the iteration number for each temperature is set to 500, and 4) the maximal allowed size difference between different partitions is 25% of the average size of the partitions. Table 1: Summary of Partitioning Results Design Elliptical filter FRISC microprocessor Square root algorithm
No. of Places
No. of Nodes
No. of Partition s
No.of Cut Ctr Seq
No. of Cut Data Nodes
Cpu Time (s)
58
76
2
4
2
354
Find minimal cut of registers
2
12
6
142
3
13
8
231
Instructions are grouped according to their data dependency and hardware sharing properties
4
21
10
307
2
6
3
13.7
63
37
142
69
Remarks
Highly utilized parts are grouped together
Since the simulated-annealing algorithm is inherently random, each run of the same design with the given parameters usually gives a slightly different result. We have used the simple rule of running the algorithms for 3 times and then choosing the best result. The CPU times listed in the ta-
-8-
ble are given as CPU seconds on a Sun-SPARCstation ELC. The experimental results here are preliminary since we have not yet got time to do an extensive test with different combination of parameters. We expect to have better results when more testing will be performed and a more clearly identified method for choosing parameters is developed. We will also work on improving the efficiency of the algorithm in the near future. The first example we have run is a fifth-order digital elliptic wave filter from [4]. It is one of the benchmark examples from the 1988 ACM/IEEE workshop on high level synthesis [2]. The final partitioning result in respect to the extended Petri net data path is shown in Figure 3 where the
Sorry, this figure is not on line. Please contact the authors to get a paper copy.
Figure 3. Partitioned Elliptical Filter’s Data Path
-9-
shadow nodes form a partition and the rest defines the other. Only two registers (registers op12 and op19) are needed to be shared between the two partitions, which is the best case with this example when the impact of operation scheduling is not taken into account. The second example is the FRISC microprocessor, which is used by several synthesis systems [1]. We have performed several partitioning to illustrate the multi-way partitioning feature of our algorithm, namely it can be used to partition a system into different numbers of clusters. The results have shown that the instructions of the microprocessor are grouped into natural clusters according to their data dependency and sharing of registers and hardware units. We have used the third example, the square root algorithm implementation, to demonstrate the feature of software/hardware partitioning. The square root algorithm uses only addition, subtraction, and comparison operations to calculate the square root of an integer. The original algorithm is presented in [9] and it has been used as a test case for several high-level synthesis systems (see, for example, [15]). Acknowledging that it is not a good test case of hardware/software systems, we select this example because it consists of both highly utilized part (inside loops) and part which is seldom used. Our algorithm has successfully partition the design into two clusters, one consists of highly utilized part of the design and the other only the lowly utilized part. In this way, a hardware/software boundary can be easily identified. For the elliptical filter, on the other hand, the repetition count of the operations play no role at all in the partitioning procedure, since all of the operations have the same repetition count. From these examples, it can be seen that our algorithm can deal with different types of designs, from microprocessor to digital signal processing systems to very spacial algorithmic examples. At the moment, we use the same parameters and the same edge weight-multipliers for the different types of designs when testing our partitioning procedure. A future work of this project is to develop heuristics to automatically select parameters for the simulated-annealing algorithm and the weight-multipliers for the partitioning graph.
6. Conclusions We have presented a method to partition a digital system into several clusters so that the cost in term of communication/synchronization between the clusters are minimized. The partitioning results are then used to guide the high-level synthesis process so that the global communications will be reduced. Our approach differs from most other architectural level partitioning algorithms by taking into accounts several aspects of the design simultaneously. This is done by mapping the control flow graph and the data flow graph onto a single partitioning graph and introducing different arcs be- 10 -
tween nodes in the partitioning graph to reflect the closeness between them in terms of different design criteria. The second feature of our approach is that our partitioning algorithm is defined based on the intermediate design representation and can, therefore, be applied during any step of the architectural and high-level synthesis processes. The third feature of our approach is that it addresses the problem of hardware/software partitioning. This is achieved by taking into account the dynamic profile information of a design and clustering operations with similar utilization frequencies together. An implementation is presented with preliminary experimental results, which have shown that our algorithms can produce improved register-transfer level design with less global communications.
7. References [1]
Berstis, V., Brand, D. and Nair, R., An Experiment in Silicon Compilation, Proc. International Symposium on Circuits and Systems, 1985, pp.655-658.
[2]
Borriello, B. and Detjens, E., High Level Synthesis: Current Status and Future Directions, Proc. 25th DAC, June, 1988, pp.477-482.
[3]
Camposano, R. and Brayton, R. K., Partitioning Before Logic Synthesis, Proc. ICCAD’87, Nov. 87, pp.324-326.
[4]
Dewilde, P., Deprettere, E. and Nouta, R., Parallel and Pipelined VLSI Implementation of Signal Processing Algorithms, S. Y. Kung, H. J. Whitehouse and T. Kailath, eds., VLSI and Modern Signal Processing, Englewood Cliffs, NJ, Prentice-Hall, 1985, pp.257-264.
[5]
Eles, P., Kuchcinski, K., Peng, Z. and Minea, M., Compiling VHDL into a high-level synthesis design representation, Proc. 1st EURO-Design Automation Conf., 1992, pp.604-609.
[6]
Fjellborg, B., An Approach to Extraction of Pipeline Structures for VLSI High-Level Synthesis, Licentiate thesis, No. 212, Dept. of Computer and Information Science, Linköping University, 1990.
[7]
Garey, M. R. and Johnson, D. S., Computers and Intractability: A Guide to the Theory of NP-Completeness, W. H. Freeman and Company, San Francisco, 1979.
[8]
Kirkpatrick, S., Gelatt, C. D. Jr. and Vecchi, M. P., Optimization by Simulated Annealing, Science, Vol.220, No.4598, May 1983.
[9]
Knuth, D., Metafont, Volume 2 of Computers and Typesetting, Addison-Wesley, Reading, Massachusetts, 1985.
[10] Lagnese, E. D. and Thomas, D. E., Architectural Partitioning for System Level Synthesis of Integrated Circuits, IEEE Trans. CAD, Vol.10, No.7, 1991, pp. 847-860. [11] McFarland, M. C., Computer-Aided Partitioning of Behavioral Hardware Descriptions, Proc. 20th Design Automation Conf., June 1983, pp.472-478. [12] Peng, Z., Kuchcinski, K. and Lyles, B., CAMAD: A Unified Data Path/Control Synthesis Environment, in D.A. Edwards (Editor), Design Methodologies for VLSI and Computer Architecture, North-
- 11 -
Holland, 1988, pp.53-67. [13] Peng, Z., Fagerstrom, J. and Kuchcinski, K., A Unified Approach to Evaluation and Design of Hardware/Software Systems, Proc. ICSE-13 (13th International Conference on Software Engineering) Workshop on Software/Hardware CoDesign, May 13, 1991, Austin, Texas, 1991. [14] Peterson, James L., Petri Net Theory and the Modeling of Systems, Prentice-hall, Englewood Cliffs, New Jersey, 1981. [15] Trickey, H., Compiling PASCAL Programs into Silicon, Ph.D. thesis, STAN-CS-85-1059, Dept. of Computer Science, Stanford University, 1985.
- 12 -