way in which the source code of a conventional ... number of example behavioural VHDL source ..... ples, techniques and tools' (Addison Wesley, 1986). WULF ...
Source level optimisation synthesis
of VHDL for behavioural
T.P.K. Nijhar A.D. Brown
Indexing terms: Behavioural synthesis, VHDL, Code optimisation
Abstract: Optimisation in high level behavioural synthesis is usually performed by applying transforms to the datapath and control graphs. An alternative approach, however, is to apply transforms at a higher level in the process, specifically directly to the behavioural source description. This technique is analogous to the way in which the source code of a conventional sequential programming language may be processed by an optimising compiler. The application of this kind of preprocessing to a number of example behavioural VHDL source descriptions (which are then fed into a ‘conventional’ synthesis system) produces structural descriptions which are up to 32% smaller and 52% faster.
1
Introduction
The high-level behavioural synthesis problem has been attacked from a number of directions. One of the central subproblems of the process is the question of design optimisation: for a given behavioural description (the input to the synthesis system), a large number of equivalent structural descriptions (the output from the synthesis system) exist. These equivalent descriptions will differ in area, processing speed and overall power dissipation. Goal oriented optimisation within the synthesis process guides the system to produce a solution which complies with the users’ requirements, and is usually achieved through a series of transformations, applied to the data and control path graphs. Aside from the datapath optimisation, it is possible to enhance the final design by performing modilications at the VHDL source level. Although the optimisation of VHDL at this level can be regarded as similar in many ways to the optimisation of a conventional, sequential programming language, there are significant differences. The optimisation goals associated with a sequential language are (usually) execution speed and program size, the latter goal referring to the memory occupancy of the final executable program. (Further levels of sophistication exist; for example, the code and data 0 IEE, 1996 IEE Proceedings online no. 19960631
Paper first received 29th January 1996 and in revised form 17th May 1996 The authors are with the Department of Electronics and Computer Science, University of Southampton, Southampton SO17 lBJ, UK
area requirements may be minimised independently.) However, these optimisation techniques are focused on maximising the execution speed on, and the resource utilisation within, a fixed target architecture - the hardware on which the program is to run. (Thus it follows that most good optimisers are machine dependent.) It is easy to lose sight of the fact that VHDL is a hardware description language: it cannot be executed in the conventional sense. The output from a segment of VHDL source is a structural description. While the goals of VHDL optimisation are broadly the same as a sequential programming language - execution speed and design area (while power dissipation has no sequential programming analogue) optimising VHDL gives a significant extra degree of freedom to the optimiser - it is free to manipulate the ‘executing’ architecture itself. While a lot of the optimisation operations are clearly based on corresponding sequential programming techniques, this extra degree of freedom requires us to modify some of the transformations, and allows us to introduce new ones that have no direct sequential programming analogue. (It also renders some transformations less useful, and introduces interactions that do not exist in the sequential programming world.) Optimisation of conventional sequential programming languages is a well researched area [l-6], and parallel programming language optimisation is also well documented [7-121. However, the problems of optimisation of hardware description languages has received relatively little attention. [4] describes an optimiser that has been incorporated into a design automation system, taking as input a behavioural description in ISPS. This is translated into a group of directed acyclic graphs known as the value trace. It is on this structure that the optimisations are performed. The transformations reported are divided into three categories: OPERATOR transforms (transformations carried out on individual or groups of operators), SELECT transforms (carried out on the conditional constructs), and VTBODY transforms (carried out on groups of value trace operators). Examples of the transformations include constant folding, dead operator elimination and loop unwinding. It should be noted that the only loop optimisation performed is that of expanding the loop, when in general it is assumed that the transformations applied on a loop will be those that have the most effect. The results describe the effect of inline expansion showing a decrease in the number of control steps between 28% and 37%.
[13] describes an optimising compiler for VHDL. Only a restricted subset of VHDL is permitted, to be able to construct a reducible flow graph. Behavioural VHDL constructs are used to describe the design, using a single process statement (i.e. one single sequential block). The description is parsed and converted into an intermediate description, which is the substrate for the optimisations. This intermediate subsequent representation is in the form of a ‘three address code’ which is used to build up a graphical representation of the design that is called a process graph. Data flow analysis is performed r -~~~~~~~~ in order to implement the transformations correctly, using an iterative method. Loop optimisations (where potentially the greatest changes may be effected) are not performed. The Olympus system [14] incorporates a set of source level transformations into the front end of its synthesis including and basic global system, local transformations (such as common subexpression and dead code removal). The loop optimisations are limited in that there is the option of expanding fixed iteration loops - however, no attempt appears to be made to optimise the loop through transformations such as loop invariant or induction variable detection. The system aiso presents the option to iniine expand subroutine calls (as does the system described here), and to bind operations to specific library functions. The system described in this paper does not share the shortcomings of the above; it can perform a larger range of optimisations on the entire synthesisable subset [ 151 of VHDL.
@
_ ,qioural
I____ speeo, _____Igwa, ___I_\
2
Goal oriented
area’
I
I
I I
t
1 I I I I
i,?
VHDL
:
I
I behavioural
_ __’
I\-
I
Fig. 1
‘+ural
VHDL
I ,cis7
+
initial configuration
- gtemperature
decreases monotonically along trajectory
-___ -_---___ iz$P I
t
VHDL
Fig. 3 (ewir ‘testbbility’goals)’
synthesis
The preprocessor described here is used as a front end to the MOODS (multiple objective optimisation by datapath synthesis) system [16-181. A brief description is relevant here because MOODS is needed to quantify the effects of the preprocessing. (For the purposes of characterising the process, we are effectively using MOODS as a postprocessor to the source optimiser.) The quantitative metrics used by MOODS for a design are area, speed, power dissipation and ‘testability’; these are attributes that may only be associated with a structural VHDL description. (The latter two attributes are not discussed in detail further here.) Both the input and output of the source optimiser are behavioural VHDL [Note 11, which cannot possess physical characteristics. The gross dataflow of the complete synthesis system is shown in Fig. 1; Fig. 2 shows a more detailed breakdown. MOODS performs the synthesis process by firstly constructing a control and datapath graph (a composite graph) of the behavioural description with a 1:l mapping of datapath operators to source constructs.
' \
\ureo
behavioural
---f'nnl .!..-.
configuration
delay
Movement of a synthesised design through design space
temperature initial configuration
Gross datafow of synthesis system
-------G!z!mw
1 I I
Fig.4
space I
I I L____---+_______A FPGA
Fig. 2
I tV!iDL
Overall structure of MOODS synthesis system
netlist
I I I I I c final delay configuration Movement of a synthesked &sign through temperature/criteria
Note 1: Nonbehavioural VHDL, such as generate statements and component calls, can pass through the source optimiser, although these are elaborated as opposed to being optimised Note 2: Small and local in this context does not necessarily imply datapath adjacency; multiplexing access to an expensive operator may create datapaths in the graph
Under the guidance of a simulated annealing algorithm [19, 201, a series of small [Note 21, local, reversible transformations are performed on the composite graph, to bring the design closer to the user specified goals. The simulated annealing algorithm thus moves the design around in ‘design space’ (see Figs. 3 and 4). Note that each point on the trajectories of Figs. 3 and 4 corresponds to a different physical realisation of the required behaviour. Examples of the transformations used by MOODS are operator sharing (replacing two operators by a single operator and a multiplexer) and operator compression (replacing two simple combinatorial operators in a linear predicate relation in the datapath graph with a single, compound operator). Seventeen transforms in all are used by MOODS; a full list and description may be found elsewhere [18]. 3
Source optimisation
The VHDL source is converted into a parse tree (consisting of a set of structures representing the various VHDL statements), and it is to this structure that the transforms are applied. In order to apply the optimisations, the data are viewed as a control flow graph (with nodes representing basic blocks and arcs indicating the data flow between them). Associated with the control flow graph is a set of data structures representing definition-use (DU) and use-definition (UD) chains. The purpose of these lists is to keep track of the variables and signals defined, together with when and where they are being read from or written to. The lists are constantly updated, as each transform performed has an effect on the list. The definition-use chain has an entry for each assignment statement, together with links to all the uses reached by this definition. The use-definition chain may be associated with any type of statement and it indicates all the possible definitions for each use in the statement. J table for b
DU table for a
I
1
1
UD table for b Fig. !i Defmition-we/we&fmition data structure
if (...) a=y*z ;=b*a q=3*a
a = f
I
UD table for a structure:
corresponding DU/UD
Fig. 5 shows an example of these structures. Fig. 5a shows a fragment of VHDL code, and Fig. 5b the corresponding DU/UD structures. The DU table for the variable a contains entries corresponding to all the assignments (and threats) to a; each entry has pointers to subsequent uses of a (i.e. to statements whose outcome is predicated on the assignment/threat to a). Each entry in the UD table has pointers back to the appropriate DU entry. 3.1 The transforms The transforms applied are listed in this Section. Each transform may be loosely considered to be of one of three types: sequential, parallel or hardware. Sequential transforms are those which are valid sequential transforms, but are of no (or negative) use in a descriptive environment. Parallel transforms are those which may be applied by parallel compilers, and are also applicable in a descriptive environment. Hardware transforms are those which optimise to a goal unique to a hardware description language (such as power dissipation or system testability). The optimiser currently uses a lexicon of 23 transforms (MOODS uses 17 for the composite graph optimisation). These are described in Table 1. As with a conventional sequential optimiser, the individual transforms may be grouped together to produce sets that are most appropriate for a particular goal (i.e. area or delay). By studying the effects of the transforms individually on a number of test data sets (we have a library of about 40 VHDL benchmarks [21-231) we have arrived at the empirical groupings shown the Table. 3.2 Transform groupings As well as the goal groupings, there exist various relationships between the individual transforms (this helps determine the order in which they are applied). The transforms interact in one of three ways: No relationship: The order of application of the transforms is interchangeable. This covers the majority of cases. Mutually exclusive: The outcome of one optimisation renders the other ineffective. The two transforms should never be applied together. Dependent transforms: The dependent transforms are more effective if applied after the first (independent) operation. Transforms that have a considerable effect on the data structure (such as loop unwinding, component flattening and inline expansion) are applied first so that the remaining transforms are applied to as large a data area as possible. The peephole class of transforms are applied at several stages during the optimisation process. These are a fast and inexpensive set of transforms to apply, and can be used to simplify the data structure after transforms such as loop unwinding have been applied. The transform interactions are shown in Fig. 6. 4
Examples
Results from two sizeable behavioural descriptions are presented here. Recall that the source optimiser is a behaviour - behaviour translator, and that MOODS is a behaviour --, structure translator. Thus, to quan-
Table
1: Transform Area
Delay
(1)
yes
(2)
taxonomy Power
Test
Transform description
yes
no
Constant folding (peephole): a transformation and attempts to simplify the expression
yes
yes
no
Algebraic simplification (peephole and operator independent): that follow a particular pattern, such as x + 0, x * 1
(3)
yes
yes
no
Reduction in strength (peephole):
(4)
yes
no
no
Common subexpression removal: identical subexpressions are detected and replaced with references to a new variable, which evaluates the common expression. This and transformation 22 (register minimisation through common expressions) have mutually opposed effects and should not be applied together
yes
that replaces constant references with actual values
replacement
the simplification
of expressions
of specific operations with equivalent
cheaper ones
no
Dead code removal: removal of code that has no further use
yes
yes yes
no
Copy propagation: replacing references to copies (statements of the form f:=g) with the value of the copy statement. This transformation must be applied prior to common subexpression removal so that as many similar expressions as possible may be detected. Also it must be followed by dead code removal, to delete the redundant copy expressions
yes
yes
n/a
no
Loop unwinding: expanding fixed iterative loops. The effect of this transform is dependent on the size and contents of the loop. It is effective in terms of area to apply this transformation when the loop is small or the contents of the loop is highly dependent on the index, which will allow various peephole transformations to be performed, simplifying a significant amount of the hardware generated
(8)
yes
yes
yes
no
Sequential loop fusion: this transformation involves merging two loops which have the same number of iterations. This optimisation is rendered ineffective if followed by transformation loop unwinding
(9)
yes
yes
yes
no
Parallel loop fusion: an extension of sequential loop fusion, applied to two loops of differing sizes. The loops are fused together with the use of a conditional statement, so they can exit when necessary. This optimisation is rendered ineffective if followed by transformation loop unwinding
(IO) yes
yes
yes
no
Loop invariant: moving of constant expressions, within a loop, to the outside of the loop
(11) yes
yes
no
no
Induction variable removal: the basic principle of the induction variable transformation includes locating two identifiers which remain in lockstep within a loop (i.e. every time variable ‘a’ is increased/decreased by ‘x’; ‘b’ is increased/decreased by ‘y’). Once the induction variables have been detected any multiplication operations can be reduced in strength by replacing them with addition/subtraction operations. An initialisation statement is placed in a preheader block.
(12) no
yes
yes
no
Loop unswitching: loops which contain a conditional statement only have the conditional outside the loop and the branches enclosed within loops.
(5)
yes
(6)
yes
(7)
(13) yes
no
yes
no
Conditional fusion: merging of conditional
(14) no
yes
no
no
Conditional parallelisation: placing each branch of a conditional in parallel. This is only effective when the conditional is a single branching if-then-else statement or when it is a multiple if-thenelse construct with no default branch.
(15) yes
yes
yes
no
Subroutine inline expansion: replacing subroutine calls with the actual functionality of the unit being referenced. This is effective mainly for optimising with respect to delay, or when optimising for area and the subroutine in question is small or makes relatively few calls itself
(16) yes
yes
yes
no
Merging subroutine calls: replacing subroutine calls which have identical input parameters one call, and a series of assignment statements to the output parameters
(17) -
-
-
-
Component flattening: evaluation of any component calls within an architecture. The transformation has no effect whatsoever with respect to any of the stated optimisation goals, but it does empower a wider range of subsequent transformations.
(18) yes
yes
yes
no
Merging component calls: the same type of transformation component references
(19) no
no
yes
yes
Grouping dependent sequence
(20) yes
no
yes
no
Bit range bound testing: checking that bit vectors that have been declared, make full use of the range specified, and if not, adjusting the range accordingly
(21) yes
yes
yes
yes
Library dependent replacement: this optimisation is system dependent. It replaces the designer’s own type declarations and function routines with those designed specifically for MOODS
(22) no
yes
yes
no
Register minimisation through common expressions: placing references to a variable with the expression represented by it. This and transformation 4 (common subexpression removal) have mutually opposed effects and should not be applied together. However, it will reduce register usage and decrease the delay of the design, with the possibility of area being reduced by subsequent dead code removal
(23) yes
no
no
no
Pre-emptive processing: evaluating the possible targets of a conditional, in parallel, prior to calculating the conditional expression. The required result can then be assigned using copy statements
statements:
tify the effects of the source level optimisation, it is necessary to run the output of the source optimiser through MOODS, in order to generate quantitative
reordering
statements which are triggered
placed
statements,
by the same state.
with
as merging subroutine calls, applied to so that dependent
statements
occur in
metrics for the quality of the design. However, although MOODS is an optimising compiler, it can be directed to translate behaviour to structure on an ‘as is’
basis. This allows us the closest possible quantitative look at the output from the source optimiser. 21
10
‘?ilij 11
I
The initial behavioural specification takes 111 lines of VHDL, and generates 32 functional units. The effects of the various optimisation paths for the filter are shown in Fig. 7. (The AD and DA paths should be rather pointless - they imply contradictory goals for the two stages of optimisation, but they are included for the sake of completeness.)
c
043 7
a
9
Fig. 6 A-B A- B
300.
Nz m 0
250.
i
200.
Transform interactions A and B are mutually exclusive; their effects counteract each other A must be applied before B to achieve maximum benetit
The possible routes for a VHDL description through the overall synthesis suite are shown in Table 2. Table 2: Optimisation
path labels
Datapath optimisation
Source optimisation with respect to:
with respect to:
area
NOTHING
AN
area
area
AA
area
delay
AD
delay
NOTHING
DN
delay
area
DA
delay
delay
DD
DISABLED
NOTHING
NN
DISABLED
area
NA
DISABLED
delay
ND
Table 3: Quantitative
1001 0
4
a
12 16 20 24 28 delay, l.ts Fig. 7 Design space trajectoriesfor optimisation of Kalnum filter Path labels are described in Table 2 ; design data are tabulated in Table 3
Path label
NN
data for Fig. 7
Path label
Functional units
Execution time, s
Register count
NN
32
-
92
NA
22
3.05
10
19
ND
24
3.05
11
22
DN
79
73
202
DA
9
121.43
8
71
DD
10
65.15
7
68
AN
24
40
85
AA
18
2.23
9
14
AD
19
2.23
10
18
-
-
Control states 43
4.1 The Kalman filter The Kalman filter is regarded as a primitive computational building block in many modern control theory applications. The goal of the Kalman filter is to predict a system state from a set of measurements of smaller dimensionality than the state vector [23].
6001 0
i 2
1 4 6 a 10 delay.ks space trajectories for optimisation of dyferential heat
Fi 6 Desi re Pease algont+zl Path labels are described in Table 2 ; design data are tabulated in Table 4
The results show that optimisation at the source level is clearly a worthwhile step in the overall optimisation process. The design optimised at the source level alone shows significant improvements over the NN optimised design, and the two optimisers (source level and composite graph) operating in cascade show greater benefits still.
What is unexpected from Fig. 7 is that optimisation with respect to area at source level followed by optimisation with respect to delay at composite graph level produces a design which has less delay than any other combination. This is completely counterintuitive, but the Kalman filter is not the only design to exhibit this effect. A detailed tabulation of the metrics of Fig. 7 (functional units, execution time, design register count and controller size) is given in Table 3. This notwithstanding, the execution speed of the system is sufficiently fast that all eight combinations of optimisation may readily be tried for each design. (The runtimes shown in Fig. 7 are taken from a 486 SOMHz PC running Linux.) 4.2 The differential heat release algorithm The differential heat release algorithm models the heat release within an internal combustion engine. It consists of about 100 lines of code, in the form of a single process. Control features and data structures include a number of while loops and a bit vector array. The impact of the source preprocessor on the design is clear from Fig. 8; the results show a decrease of between 35 and 42% for area and 69-85% for delay (as well as a significant decrease in datapath optimisation runtime). Quantitative details are presented in Table 4. Table 4: Quantitative
data for Fig.
8
Path label
Functional units
Execution time, s
Register count
Control states
NN
428
-
238
1378
NA
424
45.65
222
1361
ND
415
36.18
220
1329
DN
27
38
99
DA
13
3.07
7
19
DD
14
2.96
6
20
36
98
-
AN
26
AA
13
2.89
6
18
AD
14
2.92
6
22
5
-
Final remarks
Figs. 7 and 8 show the interactions of the two optimisation techniques for different user goals. The symbiosis of the two techniques is evident from the juxtaposition of the resultant designs in the twodimensional design space shown. In all cases, sensible application of the optimisations (i.e. the AA and DD designs) produce significantly superior designs. The optimisation of synthesised digital design is a crucial step in the overall process, as it can reduce overall area and delay by factors of up to an order of magnitude. Optimisation of the composite graph has been shown to produce very efficient optimisations,
and the inclusion of the technique described here produces further gains, at relatively little computational cost. Further work is being undertaken on more sophisticated source level transforms that capitalise on aspects of the source description to hardware mapping. These will be reported in a later paper. 6
Acknowledgment
The work described in this paper was funded by the Engineering and Physical Sciences Research Council. References ALFRED, A., SETHI, R., and ULLMAN, J.: ‘Compilers, principles, techniques and tools’ (Addison Wesley, 1986) WULF, W.M., JOHNSSON, R.K., WEINSTOCK, C.B., HOBBS. S.O.. and GESCHKE. CM.: ‘The design of an ontimis1 ing compiler’ (American Else&r, New York, 19%) 3 BOYLE, D., MUNDY, P., and SPENCE, T.M.: ‘Optimisation and code for several machines’, IBM J., 1980, 24, (6), pp. 677-682 4 WALKER, R.A., and THOMAS, T.E.: ‘Behavioural level transformation in the CMU-DA system’. Proceedings of the 20th IEEE Design Automation Conference, 1983, pp. 788-789 5 TANENBAUM,, A.S., STAVERN,, H., and STEVENSON, J.W.: ‘Using peephole optimisation on intermediate code’, ACM Trans. Program. Lung. Syst., 1982,4, (l), pp. 21-36 6 LOWRY, E.S., and MEDLOCK, C.W.: ‘Object code optimisation’. Commun. ACM. 1969. 12. DD. 159-166 7 PADUA, S.D.: ‘Issues in the c&pile time optimisation of parallel programs’. Proc. of the International Conference on Parallel processing, II, 1990, 8 FERRANTE, J., and MACE, M.: ‘On linearizing parallel code’, ACM, 1984, pp. 179-189 9 FERRANTE, J., MACE, M., and SIMONS, B.: ‘Generating sequential code from parallel code’, ACM, 1988, pp. 582-593 10 GRUNWULD. D.. and SRINIVASAN. S.: ‘Data flow eauations :og;explicitly parallel programs’. 4th ACM PPOP, 1993, pp. 15911 MASTICOLA, S.P., and RYDER, B.G.: ‘Non concurrency analysis’. 4th ACM PPOP, 1993, pp. 129-137 12 SHASHA, D., and SNIR, M.: ‘Efficient and correct execution of parallel programs that share memory’, ACM Trans. Program. Lung. Syst., 1988, 10, (2), pp. 282-312 13 BHASKER, J.: ‘Implementation of an optimising compiler for VHDL’, Sigplan Not., 1988, 23, pp. 92-103 14 F9yf,HTON, A.: ‘VHDL for logic synthesis’ (McGraw-Hill, 15 BAKER, K.R., CURRIE, A.J., and NICHOLS, K.G.: ‘Multiple objective optimisation in a behavioural synthesis system’, ZEE Proc. G, 1993, 140, (4), pp. 253-260 16 BAKER, K.R., BROWN, A.D., and CURRIE, A.J.: ‘Optimisation efftciency in behavioural synthesis’, IEE Proc. Circuits Devices Syst., 1994, 141, (5), pp. 399406 17 KIRKPATRICK, S.: ‘Optimisation by simulated annealing: auantitative studies’. J. Stat. Phvs.. 1984. 34. (516). DD. 975-986 M.P.: 18 KIRKPATRICK, ,S, GELATT,,‘C.D.,’ and %&HI, ‘Optimisation by simulated anneahng’, Science, 1983, 220, (4598), nn h71-lm z-r’-‘--19 BAKER, K.R.: ‘Multiple objective optimisation of data and control paths in a behavioural silicon compiler. PhD thesis, University of Southampton, 1992 20 VEMURI, R., ROY, J., MAMTORA, P. et al.: ‘Benchmarks for high level synthesis’. Electrical and Computer Engineering Department, University of Cincinnati, June 1991 21 PANDA, P.R., and DUTT, N.: ‘1995 high level synthesis design repository’. University of California, Irvine, February 1995 MAYO, R.: ‘High level synthesis benchmarks’, 1991 z; DEMICHELI, G., KU, D., MAILLOT, F., and TRVONG, T.: ‘The Olympus synthesis system’, IEEE Des. Test Comput., 1990, 7, pp. 37-53