implemented by integer variables (sm1 and sm2). In this case, the following encoding has been used for the places: in SM1 places p0 and p2 are represented by ...
Logic synthesis techniques for embedded control code optimization Jordi Cortadella Universitat Polit`ecnica de Catalunya Barcelona, Spain
Luciano Lavagno Ellen Sentovich Cadence Berkeley Laboratories Berkeley, USA
Abstract We propose software synthesis and optimization based on the composition of three techniques that have recently been developed or experienced evolutionary changes. They all have a basis in logic synthesis, but each contains quite different analyses and transformations. They are applied to software for embedded systems. They are generally much more expensive than optimizations normally used in general-purpose compilers, but are based on a much more precise understanding of the meaning of the program or procedure that is being optimized. Standard compilers are limited in the type and amount of optimizations they can perform by:
execution time limits; since they must be applicable to large programs within a short compile-and-debug cycle, they cannot afford to use optimizations whose complexity is more than quadratic (or even linear) in the size of the source code, the richness of the language; since the determining the function computed by a program is impossible in general, compilers must limit themselves to a small set of transformations that can be shown to preserve the original semantics. Embedded systems are quite particular with respect to both of these aspects. The (often) small code size and the (hopefully) long lifetime and tight cost constraints make even very complex optimizations feasible, profitable, and necessary. Moreover, a finite state programming paradigm is often required (e.g., due to the lack of virtual memory), thus making it possible to “understand” the function that must be computed and apply truly global optimization techniques. In this paper, we survey and compare three steps that have been used in order to implement programs specified using a family of languages called Synchronous Languages [10]. These languages have been specifically designed for embedded code, and have a semantics based on Extended Finite State Machines. These EFSMs are similar to FSMs, with explicit connections to a data-path (through signals and variables) that is for the data-intensive computations. In EFSM synthesis, the emphasis is on control. The first section summarizes a technique to analyze for correctness and derive a fully abstract EFSM model from a synchronous program. It is based essentially on a circuit representation of the program itself. The second section describes an efficient software implementation of an EFSM, based on Binary Decision Diagram minimization. The third section shows how Petri Net synthesis algorithms, originally developed for asynchronous circuit synthesis, can be used to derive a single-appearance schedule for a reactive control program 1 . Finally, we discuss the complexity aspects of each algorithm, and describe how they can be put together into a single, efficient optimizing compiler for EFSMs.
1 From Synchronous Program to EFSM There are a number of synchronous languages and compilers interpreting them; only a few of these have a very precise and well-defined semantics. One such language is ESTEREL[4]. ESTEREL is used to specify complex control algorithms, and the ESTEREL language and compiler utilize the synchronous hypothesis: the reaction of the program takes zero time, or negligible time with respect to the environment in which it runs. This hypothesis must be checked on the final This work has been funded by CICYT TIC95-0419 and MURST project “VLSI Architectures” 1 Before,
such schedules were only known for programs synthesized from data flow networks [6].
1
implementation. The hypothesis greatly simplifies synthesis and verification, since the interaction of a set of modules depends only on their relative functional behaviors and not on their timing. A set of specified ESTEREL modules communicate with each other through broadcast signals. At each reaction, or clock tick, each module reads the set of signals, and reacts to this set possibly by emitting new signals, all in “zero time”. It begins its control flow at the point it stopped in the last reaction (an await-type statement) and continues, passing control from statement to statement until it hits another await-type statement. Signals that are read and written during a reaction are done so “at the same time instant”. Time only elapses between reactions. This implies a fixed point on the set of emitted signals at each reaction. The ESTEREL semantics give precise rules for computing this fixed point (and determining whether or not such a fixed point exists). If a fixed point exists, it can be shown that the ESTEREL program has a precisely equivalent implementing finite-state machine, and hence can be translated directly to a Boolean circuit computing the fixed point at each reaction. This observation led to a drastic change in the ESTEREL compiler [5], where now Boolean circuits and the results of logic synthesis are used to analyze ESTEREL programs and optimize ESTEREL-produced code. Currently, the ESTEREL compiler translates a set of ESTEREL modules into a Boolean circuit before producing C code, which can be interfaced with more data-path-like structures to complete a design and generate a C simulation program. This ability to broadcast signals and transfer control means that one can write syntactically correct programs that do not have a meaningful implementation: programs that are either non-deterministic (more than one solution exists), or non-reactive (no solutions exist). For example, a simple program emitting signal X only if X is present has two solutions: X present and X absent. A simple program emitting X only if X is not present has no solutions. This problem is known as the causality problem, and its proper solution requires a three-valued simulation over the reachable state space of the underlying state machine [11, 16]. The constructive causality semantics, as described by their behavioral, operational, and electrical equivalents, can be found in [3]. One result contained therein is that a constructively causal program at the behavioral level is precisely equivalent to an electrically deterministic circuit, regardless of delays. That is, if the Boolean circuit has a unique and well-defined steady-state, the ESTEREL program from which it was derived is causal. A constructively causal circuit is both deterministic and reactive2 . One set of circuits that is electrically causal is the set of acyclic circuits: given enough time to react, an acyclic circuit will compute a single set of values on its outputs. Thus all non-causal circuits are cyclic, but not all cyclic circuits are non-causal. Furthermore, if a circuit is causal, it has an acyclic equivalent. The state-of-the-art method for handling the causality problem is to generate a Boolean circuit equivalent to the behavioral specification, break all cycles in the circuit, and perform causality analysis using three-valued simulation. If the circuit is causal, an equivalent acyclic version is produced and further synthesized and optimized using standard tools3. Finally, C code (or hardware) implementing the original program is produced. This approach then has the capability of producing accurate implementations (in terms of a finite state machine description, later translated to software or hardware) for all causal synchronous language specifications. As an example of this process, consider the following ESTEREL module: module SR: input S, R; output Q, QB; present S and not QB then emit Q end || present R and not Q then emit QB end end module which is similar to an SR latch. The inputs are S and R, and the internal state variables, which double as output 2A
reactive, deterministic program is referred to as logically correct. A logically correct circuit is not necessarily causal; see [3] for more details. synthesis tools handle only acyclic circuits.
3 Current
2
variables, are Q and QB. The circuit implementing this ESTEREL program is given by4 S
g1
Q
g2
QB
R
There is clearly a cycle from S through g1 to Q through g2 to QB and back to the input of g1. Causality analysis would break an arc on the cycle, for example from QB to the input of g1, assign this input to g1 the unknown value U, and simulate the circuit for all possible input vectors5 . If the outputs are well-defined for each input vector, the circuit is causal. In this case, if S is 0, Q is 0 and QB is equal to R (which is a unique value). A similar analysis holds when R is 0. If both S and R are 1, both Q and QB remain undefined. In this case, one cannot determine a priori the output of the circuit, and thus it is non-causal. Now suppose the 1; 1 input for S,R is disallowed. This can be done in the ESTEREL program by adding the statement relation S # R; which indicates that signals S and R are never emitted simultaneously. An equivalent circuit might be S’ g1
Q
g2
QB
S
R’
R
Here, while the circuit portion from S’ R’ to Q QB is not causal, the portion from S R to Q QB is causal: a value of 1,1 can never be produced for S’ R’, and so the cycle is never active during operation. It can be easily verified that no matter where one breaks the cycle, iterative symbolic simulation as proposed in [11] will result in well-defined output values. The equivalent acyclic circuit would be S Q QB
R
In [11], several examples of cyclic yet causal circuits were given. It is clear from those examples and a number of others in the literature, that there are cases in which the most efficient implementation (in terms of hardware area or software C-code size) is in fact derived from a cyclic representation. We treat the subject of re-deriving optimized cyclic representations from an EFSM in section 3.
2 From EFSM to Acyclic Code This section only sketches the main results. The interested reader is referred to [7]. The compilation from EFSMs to code uses a specialized form of control/data flow graph, called s-graph. The s-graph is defined so that 4 The 5 In
actual ESTEREL implementation is somewhat larger, including, for example, module initialization and ending signals. practice, of course, this is done implicitly using symbolic simulation.
3
it can implement the transition function of an EFSM, and it has a direct mapping from a BDD representation of that function, to exploit BDD optimization algorithms.
An s-graph is a directed possibly cyclic graph with the following nodes:
a source node, with a single child, a sink node, with no children, ASSIGN nodes, that evaluate a function of the EFSM input and output variables6 EFSM variable zv , and pass control to their single child,
z1 ; : : :zn , assign it to an
TEST nodes, that evaluate a function of the EFSM variables z1 ; : : :zn , and pass control to one of their children depending on the result of the function (each node has as many children as output values of the function).
This simple representation has a straightforward representation in C and can be translated with equal ease into object code by any available compiler. The execution time of the code then roughly depends on the length of a path from source to sink7 and its size roughly depends on the number of edges in the s-graph. The algorithm for deriving a minimal s-graph from the transition function of an EFSM depends essentially on the relation between a BDD representing the characteristic function of a multi-output Boolean function and an s-graph computing the function. Let us consider an arbitrary ordering of the EFSM variables, and let us assume, for the sake of simplicity, that all the variables are binary. Let Fzi =1 and Fzi =0 denote the cofactor of F with respect to zi = 1 and zi = 0. Let Szi F denote Fzi=1 _ Fzi =0 . If f (z1 ; : : :zn ) is the characteristic function of the transition function of an EFSM, then the corresponding s-graph is computed as build(0; f ). procedure build(i:index; F :function) begin if i = 0 then create a BEGIN vertex v next(v) build (1, F ) else if F = 1 then create the END vertex v else if zi is an input then create a TEST vertex v labeled with (zi ) let the “true” child be build(i + 1, Fzi=1 ) let the “false” child be build(i + 1, Fzi =0 ) else if zi is an output then create an ASSIGN vertex v labeled with zi and
Sz jij n;z is an outputFz =1 j
j
i
let its child be build(i + 1, Szi F ) return v
end The s-graph can be minimized (on the fly) by merging isomorphic nodes, exactly as in BDD canonization procedures. Theorem 2.1 [7] Let f (x1 ; : : :xm ; y1 ; : : :yl ) be the characteristic function of multioutput function f , such that yk = fk (x1 ; : : :xm ), and let z1 ; : : :zl+m be an arbitrary total ordering of its variables. Then the S-GRAPH G returned by procedure build(0, f ) computes f . It is easy to see that if 1. the same variable ordering is used for all branches of the recursion, and 2. outputs are ordered after the inputs on which they depend, 6 State 7 We
variables are just considered both inputs and outputs for the purpose of this discussion. require that there are no data-dependent loops inside an s-graph, in order to guarantee a deterministic delay bound.
4
then the size of the s-graph is the same as that of a BDD using the same unique ordering. In particular, BDD nodes labeled with an output variable have only one child with a path to the “1” leaf (the child corresponds to the value assigned to the variable). This means that well-known BDD reordering techniques can be used to determine an order yielding a minimal s-graph. We are currently exploring free-BDD minimization techniques that do not require condition 1 above. We have also tried to relax condition 2. This implies that an output variable should be assigned an expression, depending on inputs that have not yet been encountered in the recursion. While the idea seems to promise a smooth trade-off between ASSIGN and TEST vertices, it has never paid off, in a large set of experiments using real, industrial designs. Any s-graph construction algorithm based on BDDs or free BDDs, however, is forcefully limited to yielding an acyclic s-graph. Cyclic s-graphs can be sometimes much smaller than acyclic ones, when the assigned and tested expressions are complex8 . The next section discusses a mechanism to introduce cycles by using sequential (asynchronous) logic synthesis techniques.
3 From Acyclic Code to Minimized Cyclic Code This section describes how to derive, from an arbitrary s-graph, a minimized s-graph that implements the same function. In particular, we show that it is possible to derive an s-graph in which each action (TEST or ASSIGN, with a given associated expression and variable) of the original s-graph appears only once. Of course, this minimization may add some new variables, TESTs and ASSIGNs, in order to preserve the original functionality. This will be advantageous if the original actions were fairly complex, and the overhead associated with function calls is unacceptable for the application9. Here we only summarize the basic results of the theory of regions, on which the optimization algorithm is based, and we refer the interested reader to [13, 8].
3.1 S-graphs and Petri Nets A PN [14, 12] is a 4-tuple (P; T; F; m0 ) in which P is a set of places, T is a set of transitions, and F (T P ) [ (P T ) is the flow relation (P; T; F is a bipartite directed graph), and m0 is the initial marking. A marking is a multi-set of places. The cardinality of a place in a given marking is also called the number of tokens assigned to the place in that marking. A PN transition may fire when all its predecessor places have at least one token. When it fires, it decrements the markings of all its predecessors and increments the markings of all its successors. A PN is said to be safe if no place can have a marking greater than one in any of its reachable markings (here we consider only safe PNs). A State Machine is a PN such that every transition has at most one predecessor and one successor. An s-graph can be interpreted as a PN in which each node is a place and each edge is a transition. An ASSIGN node becomes a simple place with a single-predecessor transition and a single-successor transition. A TEST node becomes a place with several successor transitions, one for each possible value of the tested function, each with one successor place. The tested expression and the corresponding value label each transition. It should be obvious from this definition that a PN derived from an s-graph is a State Machine. It should also be obvious that transitions with the same label can occur in several places of the PN (see 1). The objective of this optimization procedure is to minimize the number of occurrences of each label. The problem of synthesizing place-irredundant Petri nets was already solved in [8]. The algorithms given there minimize the number of transitions in the PN, and hence they can be used in order to minimize the code size of the s-graph, since transitions in the PN correspond to edges in the s-graph. It is also possible, by adding so-called silent transitions (transitions without a label), to synthesize a PN in which each transition label appears only once. This can be used to obtain a single-appearance schedule for the s-graph label actions. Note that the algorithm of [8] does not guarantee that the minimized PN still has an efficient implementation as an s-graph. One mechanism for ensuring the existence of this backward mapping, from PNs to s-graphs is via the notion of State Machine coverability. 8 In
the discussion above they were just assumed to be Boolean values, while in real EFSMs they can be much more complex. is often the case for embedded applications in which RAM for implementing a stack is a scarce resource, and the CPU does not have efficient stack management instructions, e.g. in DSPs. 9 This
5
Definition 3.1 (SM Component [9]) A State Machine Component N1 of a Petri net N is defined as a connected subnet of N with the following properties: 1. Each transition of N1 has exactly one input and one output edge 2. All input and output transitions of a place in N1 (and their connecting arcs) also belong to N1 . Property 3.1 Given a minimal saturated Petri net N obtained as described in [2], any set of disjoint minimal places such that at least one of them is marked in any marking reachable from m0 defines an SM-component of N . A PN is SM-coverable if any place of the PN belongs to an SM component; it is easy to show that any PN can be made SM-coverable by just adding places that don’t change its firing sequences (redundant places).
3.2 S-graph optimization algorithm An SM cover of a PN can be used as a basis for s-graph (and hence code) generation as follows. 1. One SM component (typically the largest one) is implemented by using the s-graph traversal semantics from source to sink (i.e., it is implemented by the Program Counter of the CPU on which the synthesized code is executed). 2. The places not covered by this component are implemented by variables, that are incremented, decremented and tested to traverse the synthesized s-graph in a fashion that exactly mimics the PN firing rule. Hence it preserves the same sequence of ASSIGNs and TESTs as the original, un-minimized s-graph. Hence it becomes extremely important to minimize the number of these variables, as well as the number of their assignments and tests, because they become additional costs, that are not directly taken into account by the original PN minimization procedure. The algorithm works as follows 1. Generate all minimal places of the PN and a place-irredundant safe Petri net N as proposed in [8]. 2. For each place p of N do
Find a maximal subset SM of disjoint places of N that do not intersect with p. If SM does not completely cover all reachable markings, add some disjoint redundant minimal places to N until all reachable markings are covered (i.e. an SM-component is formed).
The result of the previous synthesis algorithm will be an SM-coverable Petri net in which each SM-component will only have one token, i.e. only one place of each SM-component will be marked in every reachable marking. In that case, each reachable marking can be represented as a vector m = (m0 ; : : :; mk?1), where mi is the place marked in i at marking m. For each SM-component i = fp1; : : :; pl g, an encoding function Ci : i ! IN can be defined in such a way that each marking m can now be represented as a vector of codes m = (C0 (m0 ); : : :; Ck?1(mk?1 )). We will call this vector an SM-vector. The information supplied by each SM-component of a Petri net can be redundant with regard to the information provided by the other SM-components. This becomes obvious in those reachable markings in which two or more SM-components share some marked place. In the framework of code generation to simulate the execution of a Petri net, a compact encoding for SM-components contributes to the reduction of operations for updating the SM-vector that represents the current marking. The encoding strategy previously proposed can be further optimized by taking into account the contribution of each SM-component to the distinguishability of different markings. The following procedure is proposed to calculate a compact encoding for the places of i , assuming that the SM-components ;0 : : :; i?1 have already been encoded:
SM
SM
SM
SM
SM
SM
SM
1. Build a compatibility graph in which each vertex represents a place of i. There is an edge between two places p1 and p2 if for all pairs (m1 ; m2 ) of reachable markings such that p1 is marked in m1 and p2 is marked in m2 , their sub-markings with respect to i?1 are different. 0 ; : : :;
SM
SM
6
p0
a=1
BEGIN
a=0 p2
0
p3
p1
a 1
op1
op2
op2
op1 p5
p4 b 0
1
op1
op3
op3
op2
b=1
b=0
p6
p7
begin: sm1=sm2=0; p0: if (!a) { sm2=1; goto p3; } p4: op2; sm1=1; if (b) { sm2=0; goto p7; } sm2=1; p3: op1; p7: if (sm1) op3; if (!sm2) op2; end:
op3
END
(a)
(b)
(c)
Figure 1: (a) S-graph, (b) Petri net, (c) Optimized code. 2. Find a clique partitioning of the compatibility graph. Each clique represents a subset of places that can have the same code. The overall complexity of the present approach is dominated by the synthesis of PNs from s-graphs. In the worst case, this technique is exponential on the size of the s-graph but manifests linear complexity for most practical cases ([8]).
3.3 Example Figure 1 depicts a complete synthesis example. Let us assume that op1, op2 and op3 are complex operations in the s-graph. By synthesizing a safe PN from the s-graph, only one transition per operation is achieved. The state machines covering the Petri net are defined by the following sets of places: SM0 = fp0 ; p3 ; p4 ; p7 g, SM1 = fp0 ; p2 ; p6 g and SM2 = fp0; p1; p3; p5g. SM0 is implemented by the program counter, whereas the other two state machines are implemented by integer variables (sm1 and sm2). In this case, the following encoding has been used for the places: in SM1 places p0 and p2 are represented by the value 0 and p6 by 1, and in SM2 places p0 and p1 by 0 and p3 and p5 by 1. Note than in the final code, only one instance of op1 and op3 is generated, at the expense of adding some overhead based on simple assignments and conditional branches, supposedly simpler than the saved code for the complex operations.
4 Summary and conclusions In this paper we have reviewed three optimization techniques that allow us to compile a synchronous language specification into highly optimized code. The final result is that, no matter how convoluted the initial specification was, the synthesized code:
bears no trace of the original language constructs, thanks to the fully abstract EFSM semantics, is guaranteed to execute a transition (from a set of inputs to a set of outputs, based on a set of states) in a finite, easily estimatable amount of time ([17]),
7
contains a minimal number of unique arithmetic expressions, tests and assignments originating from the source code10 .
The optimization algorithms are fairly expensive, in terms of worst-case complexity, for large specifications containing a large amount of concurrency. However, by using a design methodology that allows one to decompose the complete specification into modules, that will be optimized independently and then composed asynchronously, one can trade off the amount of optimization versus the execution time. An automatic exploration of the optimization space, in the vein of [15, 1] is left for future research.
References [1] P. Ashar and S. Malik. Fast functional simulation using branching programs. In Proceedings of the International Conference on Computer-Aided Design, pages 408–412, November 1995. [2] E. Badouel, L. Bernardinello, and Ph. Darondeau. Polynomial algorithms for the synthesis of bounded nets. In TAPSOFT ’95: Theory and Practice of Software Development, 1995. [3] G. Berry. The Constructive Semantics of Pure Esterel. ftp://cma.cma.fr/esterel/constructiveness.1.0.ps.gz.
1996.
To Appear, available now at
[4] G. Berry, P. Couronn´e, and G. Gonthier. The synchronous approach to reactive and real-time systems. IEEE Proceedings, 79, September 1991. [5] G. Berry and H. Touati. Optimized Controller Synthesis Using Esterel. In Proceedings of the International Workshop on Logic Synthesis, Tahoe City, California, May 1993. [6] S. S. Bhattacharyya and E. A. Lee. Memory management for dataflow programming of multirate signal processing algorithms. IEEE Transactions on Signal Processing, 42(5), May 1994. [7] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, E. Sentovich, A. Sangiovanni-Vincentelli, and K. Suzuki. Synthesis of software programs for embedded control applications. In Proceedings of the Design Automation Conference, pages 587–592, June 1995. [8] J. Cortadella, M. Kishinevsky, L. Lavagno, and A. Yakovlev. Synthesizing Petri nets from state-based models. In Proceedings of the International Conference on Computer-Aided Design, November 1995. [9] M. Hack. Analysis of production schemata by Petri Nets. Technical Report TR 94, Project MAC, MIT, 1972. [10] N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Academic Publishers, 1993. [11] S. Malik. Analysis of Cyclic Combinational Circuits. In Proceedings of the International Conference on Computer-Aided Design, pages 618–625, November 1993. [12] T. Murata. Petri Nets: Properties, analysis and applications. Proceedings of the IEEE, pages 541–580, April 1989. [13] M. Nielsen, G. Rozenberg, and P.S. Thiagarajan. Elementary transition systems. Theoretical Computer Science, 96:3–33, 1992. [14] C. A. Petri. Kommunikation mit Automaten. PhD thesis, Bonn, Institut f¨ur Instrumentelle Mathematik, 1962. (technical report Schriften des IIM Nr. 3). [15] A. Sangiovanni-Vincentelli, P. McGeer, and A. Saldanha. Verification of electronic systems. In Proceedings of the Design Automation Conference, pages 106–111, June 1996. [16] T. Shiple, G. Berry, and H. Touati. Constructive Analysis of Cyclic Circuits. In Proceedings of the European Design & Test Conference, pages 328–333, March 1996. 10 Exact
optimization with respect to the introduced overhead is a subject for future research
8
[17] K. Suzuki and A. Sangiovanni-Vincentelli. Efficient software performance estimation methods for hardware/software codesign. In Proceedings of the Design Automation Conference, pages 605–610, June 1996.
9