67 Passive Code in Synchronous Programs - Embedded Systems Group

1 downloads 67 Views 1MB Size Report
Thus, software generated from synchronous programs can be .... tween model checking and the dataflow analyses in code optimization were observed. ..... Trans (x) are also required (this can be proved by induction on the rank of the variables ...
Passive Code in Synchronous Programs JENS BRANDT, KLAUS SCHNEIDER, and YU BAI, University of Kaiserslautern

The synchronous model of computation requires that in every step, inputs are read and outputs are synchronously computed as the reaction of the program. In addition, all internal variables are updated in parallel even though not all of these values might be required for the current and the future reaction steps. To avoid unnecessary computations, we present a compile-time optimization procedure that computes for every variable a condition that determines whether its value is required for current or future computations. In this sense, our optimizations allow us to identify passive code that can be disabled to avoid unnecessary computations and therefore to reduce the reaction time of programs or their energy consumption. Categories and Subject Descriptors: D.3.2 [Programming Languages]: Language Classifications—Concurrent, distributed, and parallel languages; D.3.4 [Programming Languages]: Processors—Compilers, Code generation General Terms: Languages, Performance, Measurement Additional Key Words and Phrases: Synchronous languages, passive code, dataflow analysis, static analysis, optimization ACM Reference Format: Jens Brandt, Klaus Schneider, and Yu Bai. 2014. Passive code in synchronous programs. ACM Trans. Embedd. Comput. Syst. 13, 2s, Article 67 (January 2014), 25 pages. DOI: http://dx.doi.org/10.1145/2544375.2544387

1. INTRODUCTION

Synchronous languages [Halbwachs 1993; Benveniste et al. 2003] have proved to be well suited for the design of safety-critical real-time embedded systems consisting of application-specific hardware and software. There are at least the following reasons for this success: first, it is possible to generate both efficient software and hardware from the same synchronous program. Since the synchronous model of computation is shared by synchronous hardware circuits, the translation to synchronous hardware circuits is conceptually clearer than for classic hardware description languages like VHDL or Verilog. Second, it is possible to determine tight bounds on the reaction time by a simplified worst-case execution time analysis [Ju et al. 2010, 2009; Logothetis and Schneider 2003; Schuele and Schneider 2004], since by construction of the programs, only a statically bounded finite number of actions can be executed within each reaction step. Third, the formal semantics of these languages allows one to prove (1) the correctness of the compilation and (2) the correctness of particular programs with respect

Authors’ addresses: J. Brandt (corresponding author), K. Schneider, and Y. Bai, Embedded Systems Group, Department of Computer Science, University of Kaiserslautern, P.O. Box 3049, 67653 Kaiserslautern, Germany; corresponding author’s email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2014 ACM 1539-9087/2014/01-ART67 $15.00  DOI: http://dx.doi.org/10.1145/2544375.2544387 ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67

67:2

J. Brandt et al.

to given formal specifications [Schneider 2000, 2002; Schneider et al. 2006; Schneider and Brandt 2008]. 1.1. The Synchronous Model of Computation

All these advantages are due to the synchronous model of computation that postulates that a synchronous system is executed in discrete reaction steps, which are also called macro steps. In each reaction step, a synchronous program reads all inputs referring to the current clock and synchronously computes the values of all corresponding outputs as well as the values of the internal variables. All the computations done in a macro step are performed by a finite number of micro steps (e.g., simple assignments or other atomic actions). Since the micro steps executed in a macro step are all based on the same variable environment associated with the macro step, the values of variables used in a macro step must be computed before these variables are read by other microstep actions of the same macro step. This leads to the programmer’s view that the execution of micro-step actions does not take time at all, while every macro step of a synchronous program requires the same amount of logical time. As a consequence, concurrent threads run in lockstep and automatically synchronize at the end of their macro steps, so that even concurrent synchronous programs still have deterministic behaviors. As a result, deterministic single-threaded code can be obtained from multithreaded synchronous programs. Thus, software generated from synchronous programs can be executed on ordinary micro-controllers without having the need of complex operating systems. On the one hand, the synchronous model of computation simplifies programming, since developers do not have to care about low-level details, like timing, synchronization, and scheduling. On the other hand, the synchronous paradigm has some consequences that make the compilation of synchronous programs not at all straightforward. As all signals in the program should have a well-defined temporal behavior, the clock consistency is usually an issue. A clock calculus is used to infer the clock of every variable in the program as well as conditions to ensure a correct synchronization of concurrent computations. Causality analysis [Berry 1999; Schneider et al. 2004] is another challenge for compilers. Intuitively, causality cycles occur when the result of a micro-step action modifies the trigger condition that is responsible for the execution of the action. Causally correct programs therefore have a causal order of the micro-step actions in each macro step, which allows the program to compute values before they are read. Research over the last two decades has tackled this problem and others, and various compilers have been developed [Berry and Gonthier 1992; Schneider et al. 2006; Brandt and Schneider 2011; Closse et al. 2002, 2001; Edwards 2002, 2004; Potop-Butucaru and de Simone 2003] based on different code-generation schemes like automaton-based code, (boolean) equation systems, and concurrent control-dataflow graphs. While the first articles on compilation of synchronous programs focused on the correctness of the compilers considering mainly semantic issues, like schizophrenia and causality problems, new research efforts now consider the efficiency of the generated code: more and more compilers were designed with a target architecture in mind to provide a better basis for optimization. 1.2. Contribution

This article aims at improving the runtime efficiency of synchronous programs. The code optimization considered in this article is not based on a particular code-generation scheme. Instead, it is based on the identification of passive code, that is, code for computing values of internal variables that are irrelevant for the values of the outputs

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:3

in the current or future macro steps. Our contribution consists of the definition and implementation of a static dataflow analysis that determines such passive code in synchronous systems, which is the key to various optimized code generators. Hence, we neither have a specific realization (hardware or software) or target architecture in mind, nor do we consider a particular source language. Instead, we aim at optimizing an intermediate code that is based on synchronous guarded actions (SGAs), a well-known and target-independent intermediate format for synchronous languages. This allows us to apply our analysis to different source and target languages and architectures, and makes the analysis independent of a particular code generator. We implemented all procedures in our Averest framework.1 Dataflow analysis has been a static analysis tool for code optimization in classical compiler design for decades. Early work has been presented [Allen 1970; Hecht and Ullman 1973, 1974; Rosen 1977] and considers different kinds of dataflow analyses, like the computation of live variables, busy variables, available expressions (to detect shared expressions), use-def chains, and many more. However, these analyses cannot be directly applied to synchronous programs due to the different underlying model of computation and its different definition of equivalent computations. In particular, changes of the variables’ values are synchronously done at the level of macro steps. Hence, even though static dataflow analysis is a well-established tool for classic compiler optimization, it has not yet been used for the compilers of synchronous languages. The static analysis described in Tardieu and de Simone [2003] is used to compute conditions for instantaneous execution of a synchronous program that is done similarly by the control-flow predicates in Schneider [2001b]. In this article, we instead develop the first static dataflow analysis based on fixpoint computations, similar to the classic dataflow analyses. However, our dataflow analysis takes care of the synchronous model of computation and aims at the identification of passive code, as previously explained. The results of this work can be used to optimize the code that is finally generated, for example, to reduce the reaction time of programs, or to reduce the energy consumption, since less computations are performed in the macro steps. This article extends previous publications, in particular, our paper [Brandt and Schneider 2009b], where the notion of passive code was introduced, and another paper [Bai et al. 2011], where classical optimizations from compiler design were adapted to synchronous programs. This article additionally gives a formalization of passive code and the analysis so that the correctness of the presented optimizations can be verified. Furthermore, it presents a new semi-symbolic analysis, which integrates the computation of passive code with previous code-generation schemes and thereby significantly improves the usability of our approach. 1.3. Structure

The rest of this article is structured as follows: Section 2 first reviews some basic notions, which are needed to define our optimization technique. Section 3 introduces the starting points of our dataflow analysis, synchronous guarded actions, which are used as a common intermediate format for the compilation. Section 4 presents the general symbolic approach for our dataflow analysis that exactly computes for every internal variable a condition that holds if and only if the value of the internal variable is required for the current or later values of outputs. After discussing the pros and cons of this procedure, Section 5 presents a semi-symbolic version of the dataflow analysis, 1 http://www.averest.org.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:4

J. Brandt et al.

which better suits practical requirements. Section 6 shows experimental results for our implementation before we conclude with a short summary in Section 7. 2. FIXPOINT EQUATION SYSTEMS

In this section, we review fixpoint equation systems and the vector μ-calculus, which will serve as the underlying formalism of our passive code analysis. This presentation is in the spirit of a classical paper by Steffen [1991; Schmidt 1998], where analogies between model checking and the dataflow analyses in code optimization were observed. This link has influenced research in the connected areas: while compiler design got a new formal underpinning in terms of μ-calculus, model checking got access to the well-established and efficient algorithms using work lists and strongly connected components [Tarjan 1972; Sharir 1981; Jiang 1993], which lead to improved algorithms (e.g., [Cleaveland et al. 1993] or [Xie and Beerel 2000; Bloem et al. 2000]). The μ-calculus is generally considered to be the underlying formalism for modelchecking [Emerson et al. 1993, 2001], since most temporal logics, in particular CTL* (and thus, also CTL and LTL), can be translated to μ-calculus [Dam 1994; Schneider 2003]. Basically, the μ-calculus is a temporal logic which can be used to describe properties of labeled transition systems. Its original idea is due to de Bakker [1976], while its current form was developed by Kozen [1983]. In the following, we only give a short overview, which is sufficient for the rest of the article. For more details, the interested reader is referred to Kozen [1983], Emerson et al. [2001], and Schneider [2003]. In general, a transition system is described by a Kripke structure K = (I, S, T , L) with a set of states S, a set of initial states I ⊆ S, a transition relation T ⊆ S × S, and a labeling function L, which maps states s ∈ S to the set of variables that hold in s. Each formula describes a set of states of the transition system K: a propositional formula p is associated with the set of states that are labeled with p. The propositional operators are interpreted as set operations. Moreover, if φ is a formula of the μ-calculus, 3φ represents its existential predecessor states preK ∃ (φ), that is, the states having a successor in the set of states defined by φ. Similarly, 2φ is the set of universal predecessor states preK ∀ (φ), that is, states whose successors are all in the set defined by φ. Analogously, φ represents the existential successor states sucK ∃ (φ) of φ, that is, the states having a predecessor in the set of states defined by φ, and φ is the set of universal successor states sucK ∀ (φ), that is, states whose predecessors are all in the set defined by φ. Additionally, μx. (x) is the least fixpoint of function , and νx. (x) is its greatest fixpoint. Formally, we define the semantics of μ-calculus formulas as follows, where x is a variable, φ and ψ are μ-calculus formulas, and KxQ denotes the the modification of transition system K, where exactly the states in Q are labeled by x. xK = {s ∈ S | x ∈ L(s)}

¬φK = S \ φK

φ ∧ ψK = φK ∩ ψK

φ ∨ ψK = φK ∪ ψK

3φK = preK ∃ (φK ) K φK = suc∃ (φK )

2φK = preK ∀ (φK )

μx. φK =



{Q ⊆ S | φKxQ ⊆ Q}

φK = sucK ∀ (φK )  νx. φK = {Q ⊆ S | Q ⊆ φKxQ }

In the following, we use a more general variant of the traditional μ-calculus, which is commonly known as the vector μ-calculus [Cleaveland et al. 1993; Schneider 2003]. In contrast to the plain version, it describes mutually dependent fixpoints in the form

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:5

of equation systems, which may be exponentially more succinct than the unfolded μ-calculus formulas. In the vector μ-calculus, we can define several sets of states by a system of mutually dependent fixpoints. σ

x1 =1 1 (x1 , . . . , xn) . . .

σ

xn =n n(x1 , . . . , xn).

(1)

σ

Thereby, σi is a placeholder for either μ or ν. The notation xi =i i means that xi should be the least (if σi = μ) or greatest fixpoint (if σi = ν) of the function that maps xi to the right-hand side i . Full details on the semantics as well as on related model-checking algorithms can be found in Schneider [2003]. Since we formulate our passive code analysis by means of the vector μ-calculus, any model checker that supports this logic (e.g., the Averest framework) can be used for our analysis. Furthermore, the underlying theory of vector μ-calculus model checking also answers questions about the complexity of the computation [Schneider 2003]. Since we only use least fixpoints, that is, alternation-free μ-calculus, the fixpoints can be computed in linear time with respect to the size of the transition system. 3. SYNCHRONOUS GUARDED ACTIONS 3.1. Syntax and Informal Semantics

The symbolic dataflow analysis which will be presented in the next section does not consider programs at the source-code level. Instead, it is based on our Averest Intermediate Format (AIF), which is used, as the name suggests, as the intermediate code representation in our Averest framework. As expected, it is possible to translate a synchronous program to a set of equivalent guarded actions. Within Averest, we have implemented the translation for the Esterel-like language Quartz and for the Lustre-like language Opal. We do not consider such translations to AIF in this article; the interested reader is referred to related publications [Potop-Butucaru et al. 2007; Halbwachs 1993; Ryamond 1991; Brandt and Schneider 2011, 2009a; Synchron 1998]. AIF was designed in the spirit of guarded commands [Dijkstra 1975; Chandy and ¨ Misra 1989; Dill 1996; Jarvinen and Kurki-Suonio 1990], which are a well-established formalism for the description of concurrent systems. However, in contrast to many other applications, the guarded actions considered here follow the synchronous model of computation. Hence, the starting point of our analysis is a system described by a set of SGAs of the form γ ⇒ A . The boolean condition γ is called the guard and A is called the action. In this article, actions are restricted to the assignments of the source language, that is, the guarded actions have either the form γ ⇒ x = τ (for an immediate assignment) or γ ⇒ next (x) = τ (for a delayed assignment), while the guarded actions of our Averest framework may also contain assumptions and assertions. Both kinds of assignments evaluate the right-hand side expression τ in the current macro step. Immediate assignments x = τ transfer the obtained value of τ immediately to the left-hand side x, whereas delayed ones next (x) = τ transfer this value in the next step. Guarded actions are generated for the data- and the control-flow of the program. The dataflow consists of all assignments that occur in the program and determines the values of all declared variables. The control flow consists of actions of the form

γ ⇒ next ( ) = true , where γ is a condition that is responsible for moving the control flow at the next point of time to location . The execution of the guarded actions obeys the synchronous model of computation: each variable has a unique value in each macro step. The set of possible values of a variable is determined by its type, and its current value in a macro step is determined

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:6

J. Brandt et al.

Fig. 1. Quartz program and its synchronous guarded actions.

either by the immediate assignments executed in the current macro step, the delayed assignments executed in the previous macro step, or the default reaction as explained in the next section. In each macro step i, the system reads inputs and computes outputs, in principle, by evaluating all guarded actions. In general, a system will also have internal variables whose values are also determined by guarded actions in the same way as for output variables, that is, either by immediate assignments enabled in the current macro step, by delayed actions enabled in the previous macro step, or by the default reaction if no action determines the value. Internal variables may stem from locally declared variables or from control-flow locations. Furthermore, internal variables are generated by the compiler in order to abbreviate common subexpressions to avoid their recomputations. Figure 1 shows a simple example of a Quartz program and its guarded actions as computed by our compiler. In each reaction step, the program reads the three boolean inputs a, b, and r and generates a boolean output o. There are three labels wa, wb, and wc, where the control flow may rest. In addition to the guarded actions, there are also definitions of internal variables t1 , t2 , t3 , and t4 that have been generated by the compiler to share common subexpressions. For each one of the writable variables o, wa, wb, and wr, there is one guarded action. The one for o is the only guarded action of the dataflow, and the remaining three guarded actions determine the control flow of the program. Finally, the additional variable start is used to start the overall computation of the program. This variable is often called the boot location of the program. In addition to the set of guarded actions, there are implicit assignments due to the semantics of the program: the so-called default reaction defines the value of a variable if no action determines its value in the current macro step. For a variable y, this is the case if and only if the guards of all immediate assignments to y are false in the current step and the guards of all delayed assignments to y were false in the preceding step. In this case, the default reaction determines the value of the variable, which depends on the storage mode of the variable. There are two storage modes, namely, event and memorized, which are part of the declaration of the variable (see Figure 1): event variables EventVars(V) ⊆ V are reset to their default values (like wires in hardware circuits) by the default reaction, while memorized variables MemVars(V) ⊆ V store their previous values (like registers in hardware circuits). In principle, the default reactions can also be expressed explicitly as guarded actions (at the price of additional variables). However, for practical reasons (in particular, to support modularity of the intermediate code [Brandt and Schneider 2011, 2009a]), this information is kept separately from the other guarded actions. For example, the output variable o used in the program shown in Figure 1 is an event variable (indicated by the keyword in the declaration). The storage mode of input variables is irrelevant and the location variables, like wa, wb, and wr, in the program of Figure 1 are always event variables. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:7

Fig. 2. The initial condition and transition relations of a variable x ∈ V L ∪ V O ; default(x) is a default value determined by the type of x.

3.2. Formal Semantics

In the following, we use a semantics based on transition systems to describe the behavior of a synchronous system, as proposed in Schneider [2001a]. Such a system consists of a set of SGAs G and a set of variables V, which is the disjoint union of input variables V I , local/internal variables V L, and output variables V O occurring in G. Definition 3.1 (Synchronous System Semantics). The semantics of a synchronous system (V, G) defined over variables V = V I ∪ V L ∪ V O is given by a transition system2 K = (S, I, T ) with states S, initial states I ⊆ S, and transition relation T ⊆ S × S. Each state s is a mapping from variables to values, that is, s maps each variable x ∈ V to a value s(x) of its domain. Assume that G contains for the variable x the guarded actions (γi , x = τi ) for i ∈ {1, . . . , p} and (χi , next (x ) = πi ) for i ∈ {1, . . . , q}, then Figure 2 shows how the initial condition Init(x) and transition condition Trans(x) are defined for variablex. The initial condition  I and the transition relation T of K are then defined as I := x∈VL∪V O Init(x) and T := x∈VL∪V O Trans(x). The initial condition Init(x) simply states that whenever a trigger condition γi holds, then the corresponding equation x = τi must also hold. If no trigger condition γi holds, then x equals to its default value Default(x) (determined by its type). The transition relation Trans(x) also demands that the equation x = τi must hold if the trigger condition γi holds. Moreover, if the trigger condition χi of a delayed assignment next(x) = πi holds now, then x must have the value πi at the next point of time (note that πi is evaluated at the current point of time). If neither an immediate nor a delayed assignment is enabled, then the value is specified by the reaction-to-absence Absence(x), which differs for event and memorized variables. This transition system is given at the level of macro steps, that is, each transition corresponds with a macro step of the synchronous program. One can also define a transition system at the level of micro steps, which is required to reason about semantic problems, such as causality (see Section 3.3 or [Berry 1999; Schneider 2009]). Since we assume that the given set of SGAs is causally correct, it is sufficient to consider the macro-step transitions for our analysis.

2 In

contrast to Kripke structures, we have no label function L, since states directly map variables to values.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:8

J. Brandt et al.

Fig. 3. Opal program, its guarded actions, and its action dependency graph.

3.3. Causal Ordering of Actions

According to the synchronous model of computation, the guards of all guarded actions are simultaneously checked in each macro step. If a guard is true, then the action is enabled and immediately executed. The immediate updates of the variables due to immediate assignments raises the question on the causality of enabled actions: actions must not enable themselves to avoid nondeterminism or even loss of reactivity. Causality is a well-studied problem in the context of synchronous languages, and different synchronous languages deal with this problem in their own way. For example, Lustre [Halbwachs 1993] and Opal basically require that all immediate dependencies are acyclic, which can be very efficiently checked. Esterel and Quartz have a more general definition based on the notion of constructiveness [Berry 1999; Schneider et al. 2004], which allows for cyclic dependencies as long as they can be constructively resolved. However, the algorithms for analyzing the causality of such programs are more expensive. To illustrate the problem, consider a simple system consisting of a single guarded action ¬x ⇒ x = true with an event variable x. Since the assignment of x is executed at the same point of time when x is checked, this system has no consistent behavior. If we consider, instead, a system with the SGA x ⇒ x = true , then we see that this program is nondeterministic, since it has two consistent behaviors. In general, nonconstructive programs like these examples are not desired, and compilers for synchronous languages have to reject them. The guarded actions must be executed in a causal order, that is, according to their data dependencies. These data dependencies can be analyzed by a so-called action dependency graph (ADG). ADGs are bipartite graphs whose vertices either represent variables (solid circles for memorized variables, dashed circles for event variables) or guarded actions (rectangles). Its edges show the dependencies between these actions and variables: the edge (γ ⇒ A, x) expresses that x is written by action A, and the edge (x, γ ⇒ A) expresses that x is read by trigger γ or action A. For the write edges, we distinguish immediate (solid edges) and delayed (dashed edges) writes. Thus, the graph exactly encodes the restrictions for the execution of the guarded actions of a synchronous system. An action can be only executed if all variables occurring in its guard and the right-hand side of its assignment (i.e., its read variables) are known. A variable is thereby only known if all actions writing it have been evaluated before. Consider the Opal program given on the left-hand side of Figure 3. Our compiler computes the guarded actions shown below the program, which can be arranged in ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:9

the action dependency graph shown on the right-hand side of the figure. The ADG visualizes the data dependencies, for example, clkw must be computed before w, w before clk y , and clk y before y. The basis for our dataflow analysis is the read and write dependencies between guarded actions. We therefore start with their definition. Definition 3.2 (Read and Write Dependencies). Let FV(τ ) denote the free variables occurring in the expression τ . Then, the following definitions determine the dependencies from actions to variables.   grdVars (γ ⇒ x = τ ) := FV(γ ) grdVars γ ⇒ next (x) = τ := FV(γ )   rhsVars (γ ⇒ x = τ ) := FV(τ ) rhsVars γ ⇒ next (x) = τ := FV(τ )   rdVars (γ ⇒ x = τ ) := FV(γ ) ∪ FV(τ ) rdVars γ ⇒ next (x) = τ := FV(γ ) ∪ FV(τ )   wrVars (γ ⇒ x = τ ) := {x} wrVars γ ⇒ next (x) = τ := {next (x)} The dependencies from variables to actions are determined as follows. grdActs (x ) := {γ ⇒ A | x ∈ grdVars (γ ⇒ A)} rdActs (x ) := {γ ⇒ A | x ∈ rdVars (γ ⇒ A)} rhsActs (x ) := {γ ⇒ A | x ∈ rhsVars (γ ⇒ A)} wrActs(x) := {γ ⇒ A | x ∈ wrVars (γ ⇒ A)}

Hence, grdVars (α ) determines the variables occurring in the guard γ , and rhsVars (α ) determines the variables occurring in the right-hand side expression τ of the assignment. rdVars (α ) determines all variables that are read by the guarded action. wrVars (α ) is the set of variables that is modified if α fires. In the opposite direction, all guarded actions that read variable x in their guards, read variable x in their action, and write variable x in their action can be determined by grdActs (x ), rhsActs (x ), rdActs (x ), and wrActs(x), respectively. Note that wrActs(x) only considers immediate assignments to the variable x, while delayed ones are referenced by wrActs(next (x)). 4. SYMBOLIC DATAFLOW ANALYSIS 4.1. Passive Code

As outlined in Section 3, a synchronous program can, in general, not be executed like usual sequential programs due to the assumed synchrony: all enabled actions must be concurrently executed according to their data dependencies. Clearly, only a subset of the guarded actions are enabled within a macro step, and identifying them efficiently can significantly improve the performance, which was one of the motivations for codegeneration schemes based on the discrete event model [Closse et al. 2002] or concurrent control-/dataflow graphs [Zeng et al. 2004; Potop-Butucaru and de Simone 2003]. Obviously, guarded actions which are not enabled in the current macro step (their guards evaluate to false) do not contribute to the final result. In addition to disabled guarded actions, there is still more room for optimization: perfect synchrony demands that unique values have to be determined for all variables (including all local variables as well) in each step. However, some local values may not be needed to compute the final outputs. The key to a simple code optimization is therefore to determine actions that do not contribute to the final result, which we call passive code in the following. The notion of passive code is also known from sequential programs, for example, for code that is not required to compute the return value of a function. Note that dead code is different from passive code, since dead code is not reachable, that is, not executed in any run of the program. A detailed comparison is given in Section 5.2. The identification of conditions that hold if and only if code is passive is the problem we want to tackle in this section. To this end, we introduce for each variable x a new variable req x , which should hold if and only if the value of x is required for ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:10

J. Brandt et al.

Fig. 4. Relaxed transition system for synchronous guarded actions.

the computations of the current or future macro steps. These expressions are called required conditions. The required conditions obviously depend on the externally visible behavior of the system. Usually, all outputs of the system should be computed in all steps. Hence, we set req x = true for all output variables x ∈ V O .3 In contrast, all other variables of the system x ∈ V \ V O are only of interest if they carry values that are used now or later on to compute the outputs. Hence, their required conditions should be as strong as possible for optimization. The required conditions induce a relaxed transition system that allows arbitrary values for variables that are not required in a step. Formally, this system is defined as follows. Definition 4.1 (Relaxation). Let K = (S, I, T ) be a transition system over the variables V = V I ∪ V L ∪ V O and the guarded actions G, as defined in Definition 3.1, and let R = {req x | x ∈ V} be a set of required variables.  The relaxed transition system of

K with respect to R is KR = (S, I, T ) with T = x∈VL∪V O Trans (x), where Trans (x) is defined in Figure 4. We are interested in relaxed transition systems that preserve the behavior of the original system, that is, where for the same inputs, outputs coincide for original and relaxed versions of the transition system. Definition 4.2 (Correctness). Let K = (S, I, T ) be a transition system and K = (S , I , T ) its relaxation. The relaxation K is said to be correct if there is a bisimulation between both transition systems, that is, a relation ∼ ⊆ S × S with the following properties. —(sim0). s ∼ s implies ∀x∈V I ∪V O s(x) = s (x). —(sim1). For s ∈ I, there is a s ∈ I with s ∼ s . —(sim2). For s ∈ I , there is a s ∈ I with s ∼ s . —(sim3). For (s1 , s2 ) ∈ T and s1 ∼ s1 , there is an s2 with (s1 , s2 ) ∈ T and s2 ∼ s2 . —(sim4). For (s1 , s2 ) ∈ T and s1 ∼ s1 , there is an s2 with (s1 , s2 ) ∈ T and s2 ∼ s2 . The existence of a bisimulation relation between the two transition systems ensures both having the same observable behavior. At the end of Section 3.2, we assumed that the original system is logically correct. If this assumption would be removed, property (sim4) would have to be modified to reflect the additional paths in the relaxed transition system. 4.2. Computing Passive Code

The main task of our dataflow analysis is to compute program expressions to define the required variables req x for local variables x so that the observable behavior of the 3 These definitions of the required variables of the outputs can be weakened if the system is accompanied by a specification that implies that not all outputs are needed in all steps.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:11

Fig. 5. Auxiliary definitions for symbolic analysis.

original system is preserved. To this end, our main idea is to follow the dependencies encoded in the ADG in reverse direction and to construct the conditions for req x by a simple graph traversal over the ADG. Unfortunately, this naive computation is not possible, since the conditions req x mutually depend on each other. First, there are data dependencies across macro steps, which result from delayed actions γ ⇒ next (x) = τ . Hence, if such a variable x is required in step i, all variables in γ are required in the preceding step i − 1, as well as all variables in τ are required if γ would hold in the preceding step i −1. As an example, consider wa in Figure 1: it is set by a delayed action. This is always the case for control-flow variables of synchronous programs, since they define where the system should resume its execution in the next control state. Second, there are data dependencies across macro steps which result from the default reaction: in particular, the value of a local memorized variable might be computed several steps before it is actually used in the computation of an output. Since these dependencies are not encoded in the ADG, the dataflow analysis must explicitly handle these storage properties. Third, the ADG is, in general, cyclic, even for programs which are commonly referred to as acyclic in the synchronous languages community. For example, variable wa in Figure 1 occurs on the left-hand side and on the right-hand side of the same action, which leads to a pseudocycle in the dependency graph. For the execution of the synchronous program, this cycle does not cause any problems, since the new value is fed back only in the following macro step, but the definition of its required condition nevertheless becomes recursive. In order to determine req x , we therefore make use of the vector μ-calculus. If the dependency is across a macro step, we use the diamond operator of the μ-calculus: req x states that there is a successor state where x is required. To cope with the cyclic dependencies, we use the fixpoint operator μx. f (x), which returns the least fixpoint of the given function f . This is exactly the operation that we need in our computation, since we want to minimize the number of states where the variables req x hold. Before we define the mutually recursive equations for the req x conditions, we introduce some auxiliary definitions given in Figure 5. —DefByNow(x). A variable x is defined now if an immediate assignment to x is enabled. —DefByNxt(x). A variable x is defined next if a delayed assignment to x is enabled. —UsedByNow(x). The value of x is used by a required immediate assignment if there is a guarded action γ ⇒ y = τ , where the assigned variable y is required and x occurs in either γ or τ , where occurrences in τ only matter if γ holds. —UsedByNxt(x). x is used by a required delayed assignment if there is a guarded action

γ ⇒ next (y) = τ , where the assigned variable y is required in a successor state and x occurs in either γ or τ , where occurrences in τ only matter if γ holds. —UsedByAbs(x). Similar to UsedByNxt(x) but addresses the implicitly given reaction to absence. If currently no delayed assignment to x is enabled and there is a successor ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:12

J. Brandt et al.

state where also no immediate assignment to x is enabled, but x (hence, its current value) is required there, then req x must hold now. Whenever one of UsedByNow(x), UsedByNxt(x), and UsedByAbs(x) holds, x is required, and therefore some guarded action on x (including its reaction-to-absence) has to determine its value at that point of time. Definition 4.3 (Computing Required Conditions). The following system of mutually dependent fixpoint equations describes the desired req x conditions for a set of input/local variables x0 , . . . , xn and a set of output variables y0 , . . . , ym. ⎤ ⎡ μ req y0 = true ⎥ ⎢ .. ⎥ ⎢ . ⎥ ⎢ ⎥ ⎢ μ ⎥ ⎢ req y = true m ⎥. ⎢ μ ⎢ req = UsedByNow(x0 ) ∨ UsedByNxt(x0 ) ∨ UsedByAbs(x0 ) ⎥ ⎥ ⎢ x0 ⎥ ⎢ .. ⎥ ⎢ . ⎦ ⎣ μ req xn = UsedByNow(xn) ∨ UsedByNxt(xn) ∨ UsedByAbs(xn) As can be seen, there are three obvious reasons why a variable xi may be required: it may be currently used by a required immediate/delayed assignment or it is a memorized variable that has to be stored, since its value is used later by required computations. From the underlying theory of the μ-calculus, we can derive that the given equation system has a uniquely determined least fixpoint, since all functions are continuous by construction. The monotonicity can be easily seen, since all occurrences of the fixpoint variables req xi are positive. Continuity immediately follows from monotonicity and the finite domain of all variables (or the continuity of the operators in case of infinite data types). Provided that the synchronous system has been translated to a transition system, as given by Definition 3.1, any model checker that supports the vector μ-calculus can be used for the computation of the req x conditions. To illustrate Definition 4.3, consider the example in Figure 6. The compiler extracts from the source program in the upper-left corner the guarded actions shown in the dependency graph on the right-hand side. Then, the guarded actions are used to determine the table in the lower part of the figure, which contains the values of the auxiliary functions of Figure 5. We can also easily determine that the UsedByAbs condition is false except for variable y, where we do not need it. Finally, these values are combined in the equation system defining the fixpoint to compute the solution of the required variables shown on the left-hand side, below the program. As explained in Section 4.1, the resulting requirement conditions induce a relaxed transition system. In order to use this for an optimized implementation of the original system, it should have the same observable behavior. In the following theorem, we prove the soundness of our approach, that is, that Definition 4.3 leads to such a system. THEOREM 4.4 (CORRECTNESS OF REQUIREDNESS CONDITIONS). The relaxed transition system based on Definition 4.3 preserves the observable behavior of the original system. PROOF. We consider the transition systems K = (S, I, T ) and K = (S , I , T ) according to Definitions 3.1 and 4.1, where the latter makes use of the req x conditions determined by Definition 4.3. Between states s ∈ S and s ∈ S , we define the relation s ∼ s :⇔ ∀x ∈ V. x ∈ V I ∨ s (req x ) → s(x) = s (x) and have to prove that this is a bisimulation relation. Condition sim0 is true by definition of ∼ (since all outputs are required). Moreover, sim1 and sim3 directly follow, since S ⊆ S , I ⊆ I , and T ⊆ T holds by Definitions 3.1 ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:13

Fig. 6. Source program, action dependency graph, auxiliary functions, and equation system for requirement conditions.

and 4.1. Condition sim2 is also true by construction, since we use the same initial condition in K and K . Thus, it remains to prove sim4. To prove sim4, we have to show that for all (s1 , s2 ) ∈ T with s1 ∼ s1 , there is a s2 ∈ S such that (s1 , s2 ) ∈ T and s2 ∼ s2 holds. We choose the uniquely determined state s2 ∈ S that has a transition from s1 and has the same inputs as s2 (it is uniquely determined since we assume causal, and thus, deterministic systems K). It remains to prove that s2 ∼ s2 holds. By construction of s2 , we already have s2 (x) = s2 (x) for all x ∈ V I , so it remains to prove s2 (req x ) → s2 (x) = s2 (x) for all x ∈ V L ∪ V O . This is true, since whenever s2 (req x ) holds, then Trans(x) and Trans (x) become the same expressions, and by definition of req x , all variables that occur in the enabled actions of Trans(x) and Trans (x) are also required (this can be proved by induction on the rank of the variables, that is, the index where the variable enters the sets in the fixpoint iterations). Hence, all required variables in s2 are uniquely determined as those in s2 . ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:14

J. Brandt et al.

In principle, we could also relax the initial condition of K’. However, since the initial state has almost no influence on optimization, we decided to keep the initial conditions. Apparently, our approach defines the required conditions at the level of actions. It might be also be possible to implement a finer analysis on the basis of subexpressions, which takes lazy evaluation into account. However, we decided against it, since we do not want to depend on certain target platforms. Furthermore, we do not fix an order of subexpression evaluation, and thereby, we also circumvent another problem: due to the symmetry of some operators with respect to lazy evaluation, a straightforward computation of the requirement conditions is not possible anymore, since there is no weakest set of required conditions for a given system, for example, the following guarded action true ⇒ a = b ∧ c. Apparently, both req b = c, req c = true and req b = true, req c = b are valid, but not req b = c, req c = b. The symmetry has to be broken in either way, for example, it can be expanded to two actions φ ∧¬b ⇒ a = false and φ ∧b ⇒ a = c, where b is evaluated first and a only in the case that its value affects the overall result. 4.3. Conservative Approximations

As it is, in general, the case for symbolic model checking procedures, the computational effort can become prohibitively high for the approach presented in the previous section. In these cases, conservative approximations of the conditions can be used, which allow more reads and writes than the optimal system. A reasonable approximation is to avoid the computation of reachable states and to determine the required conditions of the fixpoint equation system only by means of rewriting and some simple reduction rules that eliminate the modal operators. Consider the example in Figure 6. (1) Replacing req y by true on all right-hand sides immediately yields req x1 = w1, req x2 = start ∨ w2, req w1 = true, and req w2 = true. (2) req w3 is a bit harder to estimate: simple rewriting yields req w3 = req w3 ∨ w1 ∨ start ∨ w2, which still contains modal operators. However, we can always safely replace a req x variable by true, and thus, we may simply estimate req w3 = true. (3) Finally, rewriting of req i1 and req i2 with the so far obtained results yields req i1 = req i2 = (start ∨ w3) ∧ (w1 ∨ (start ∨ w2)). Since w1 has only the guarded action

start ∨ w2 ⇒ next (w1) = true , we can replace w1 by start ∨ w2. Thus, we have req i1 = req i2 = start ∨ (w2 ∧ w3). Since moreover w2 → w3 holds, we have req i1 = req i2 = start ∨ w2. Thus, the fixpoint of the example in Figure 6 can be approximated as follows without using fixpoint iterations (this is even the precise result).

req y req x1 req x2 req w1

μ

= μ = μ = μ =

true w1 start ∨ w2 true

req w2 req w3 req i1 req i2

μ

= μ = μ = μ =

true true start ∨ w2 start ∨ w2

Note that approximations do not harm the correctness, but only the degree of optimizations that can be achieved. In the example, we now see that inputs i1 and i2 are only required at the even points of time, and also x2 is only needed there. Note that the original program always computed x2. 4.4. Optimizing the Intermediate Code

The required conditions and their approximations can be used in the final codegeneration step, which transforms the intermediate code to the target code. Our ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:15

definitions are the basis for the removal of passive code, which improves the runtime of the software or the energy efficiency of the hardware circuit. The most obvious optimization is to eliminate passive code by using the required conditions to strengthen the guarded actions. Any immediate action γ ⇒ x = τ can be replaced by γ ∧ req x ⇒ x = τ , and any delayed assignment γ ⇒ next (x) = τ can be replaced by γ ∧ req x ⇒ next (x) = τ (to this end, req x has to be replaced by an equivalent program expression without modal operator). This corresponds to operand isolation and clock gating in hardware circuits. In software, this may speed up the execution time of the system, but (similar to clock gating), it may also have negative effects due to the additional effort for evaluation of required conditions at runtime: one has to carefully balance this additional effort against the savings. If the evaluation of guards becomes too complex, one can weaken them by allowing the action to fire even if a variable is not needed. Any guard γ that is implied by γ ∧ req x and that still implies γ leads to a correct behavior for immediate assignments, and analogously, any guard γ that is implied by γ ∧ req x and that still implies γ leads to a correct behavior for delayed assignments. If the target is a hardware implementation, strengthening the guards in general improves the energy efficiency of the hardware circuit, since fewer gates of the circuit operate per cycle. Additionally, the elimination of passive code can be used to optimize the resource requirements of a given system. In general, a synchronous system needs dedicated resources for all actions that can execute within the same macro step, for example, if three guarded actions can be potentially run in parallel, and each of them needs a multiplier, three multipliers must be statically allocated. However, if the strengthened guards remove potential conflicts, then resources can be safely shared. In particular, just as in traditional liveness analysis, the condition req x can be exploited to determine an optimal register allocation for the variables. Two variables x1 and x2 can be mapped to the same register if they are never required at the same time, which can be simply checked by ¬(req x1 ∧ req x2 ). Apparently, the formula covers the special case of mapping two variables of distinct scopes to the same register. This optimization is useful for both hardware and software realizations. 5. SEMI-SYMBOLIC DATAFLOW ANALYSIS

In this section, we refine the dataflow analysis of the previous version to one based on extended finite-state machines (EFSMs). This variant approximates the required conditions in a conservative manner and is much faster in practice, while being exponential in theory (i.e., in the worst case). Hence, compilers should rather use the analysis presented in the following than the pure theory of the previous section. In the following, Section 5.1 first explains how EFSMs are obtained from SGAs, while Section 5.2 shows how the previous dataflow analysis can be ported to this representation. 5.1. Translation to Extended Finite-State Machines

As already explained, the approach presented in the previous sections has two significant drawbacks: first, the runtime of the analysis, as explained in Section 4.2, may exceed an acceptable amount for realistic examples, while the conservative approximations given in Section 4.3 may be too coarse. Second, even after a successful computation of the required conditions, an aggressive strategy which adds these conditions to all guards could worsen the performance of the system significantly, and the trade-off between the savings and the additional effort of strengthened guards is in general very difficult to analyze in advance. Therefore, this section introduces an alternative approach that is a conservative approximation of the dataflow analysis of Section 4. It exploits structural information of the synchronous systems by first generating an extended finite-state machine (EFSM) ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:16

J. Brandt et al.

Fig. 7. Quartz program, guarded actions, and generated extended finite-state machine.

from the SGAs of the synchronous program. EFSMs explicitly represent the state space of the control flow: each state s represents a subset Labels(s) ⊆ L of the control-flow labels, and edges between states are labeled with conditions that must be fulfilled to move from the source state to the target state. Thus, EFSMs are representations where the control flow of a program state is explicitly represented, while the dataflow is still represented symbolically (while SGAs represent both control- and dataflow symbolically). Note that in order to obtain efficient code, compilers also determine a control-flow state for dataflow languages (i.e., Lustre or Opal) by enumerating a subset of the Boolean variables [Halbwachs et al. 1991]. The guards of the actions of the control flow are therefore translated to transition conditions of the EFSM state transitions. The guarded actions of the dataflow are first copied to each state of the EFSM and then partially evaluated according to the values of the control-flow labels in that EFSM state. Hence, in each macro step, the generated code will only consider a subset of the guarded actions, which in general speeds up the execution (since most of them are only active in a few states). Definition 5.1 (Extended Finite-State Machine). An Extended Finite-State Machine (EFSM) is a tuple (S, s0 , T , D), where S is a finite set of states, s0 ∈ S is the initial state, and T ⊆ (S ∗ C ∗ S) is a finite transition relation, where C is the set of transition conditions. D is a mapping S → D, which assigns each state s ∈ S a set of dataflow guarded actions D(s) ⊆ D which are executed in state s. An EFSM is usually illustrated by a directed graph, whose vertices are the states and whose edges are the transitions labeled with the transition conditions. For example, consider the graph given on the right-hand side of Figure 7. In addition to the initial state, two states (combinations of labels) are reachable: s1 := {w1 , w2 } and s2 := {w1 , w3 }. As the control flow is explicitly enumerated, EFSMs may suffer from a state-space explosion, since n control-flow locations may result into 2n EFSM states (all locations are reachable). It is not only the number of (control-flow) states that poses problems; also the dataflow-guarded actions must be replicated in many states. However, for many practical examples, the EFSM size does not show this worst-case behavior. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:17

Furthermore, by converting control-flow labels to simple Boolean variables, a model of arbitrary control-flow size—between fully symbolic (guarded actions) and fully explicit (EFSM)—can be considered so that the code size can always be kept manageable. It is important to see the difference between an EFSM and control-flow graphs used in classic compiler design, which also has implications on the optimizations presented in the following section. While ‘states’ of classic control-flow graphs consist of assignments that are sequentially executed, states of the EFSM contain guarded actions that are concurrently executed within one macro step. Moreover, transitions in the EFSM terminate a macro step of the synchronous model so that new values of the input variables are read on the transition. Due to these differences, many transformations made in classic code optimization cannot directly be applied on EFSMs for code generation of synchronous programs. Note that the compilation to a EFSM merged all immediate transitions of the original control flow into a single state. According to the definition previously given, an EFSM state specifies which program locations are active in a step. Thus, a state can be seen as a particular assignment to the location variables L. If a location variable ∈ L is assigned true, it means that the control flow of the program is at in this EFSM state. We compute the EFSM by a symbolic simulation of the program [Halbwachs et al. 1991] which basically works as follows: we explore the reachable control-flow states in a depth-first traversal. For the initial state, we assume that all location variables are set to false. To compute the successor states, this assignment is used for a partial evaluation of the guards of all control-flow guarded actions γ ⇒ next ( ) = true , which determine the location variables in the next macro step. If γ is evaluated to false, we know that is assigned false at the next macro step. Otherwise, can be active at the next macro step. After all successor states of one state have been generated, we apply the same procedure to the new successor states. This traversal of the control-flow states finally creates all reachable EFSM states. Termination is guaranteed because the number of location variables is finite. After the generation of the states and transition relations, the dataflow-guarded actions (DGAs) are distributed to the states. The assignment of the location variables in each state is used to evaluate the truth value of the guards of all DGAs. If the guard is evaluated to false, the DGA is omitted in the state. Otherwise, we add the DGA to D(s). As an example, Figure 7 shows a Quartz program and the generated EFSM, where s1 := {w1 , w2 } and s2 := {w1 , w3 }. The guarded actions are simplified according to this state encoding. 5.2. Computing Passive Code

In principle, the conditions of Section 4 can be applied to these new actions. However, by defining our dataflow analysis for the EFSM representation, an approximation of the required conditions can be determined much easier. The basic idea of this section is to compute for each state s of the EFSM a set of required variables reqvars s , that is, the set of variables which is required in this state to be able to compute the outputs. Thereby, we basically abstract the required condition req x of each variable x to a Boolean condition only consisting of a disjunction of the states, that is, it is a conservative approximation of the previous approach. The most important advantage of the alternative approach is that the set representation of the required variables allows us to statically disable all guarded actions for redundant variables. Hence, no additional dynamic effort is introduced so that the optimization always improves the performance. We define the required variables according to the reasons why a variable y may be required in a state s. The following list first explains some auxiliary functions which are defined in Figure 8. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:18

J. Brandt et al.

Fig. 8. Auxiliary definitions for semi-symbolic analysis.

—UsedByNowVars(s) contains the variables x that are read by an action that immediately writes a variable y required in state s. —UsedByNxtVars(s) contains the variables x that are read by a delayed assignment contained in state s to a variable y that is required in a successor state s of s. —DefNowVars(s) is the set of variables having a guarded action with an immediate assignment that must become true in state s (‘must become true’ is estimated by the test procedure Check(φ)). —DefNxtVars(s) is the set of variables having a guarded action with a delayed assignment that must become true in state s (again, ‘must become true’ is estimated by the test procedure Check(φ)). —Check(φ). In principle, the previous two properties would be the disjunction of guards. However, since the decision of whether this disjunction is true or false involves a complete reachability analysis, Check(φ) conservatively approximates φ: it will only return true if φ is true; if the heuristic cannot determine a value, it conservatively returns false. In the experiments shown in the next section, we use an SMT-solver (satisfiability modulo theories) as a backend for this decision. —UsedByAbsVars(s) is the set of memorized variables that must be stored in state s due to the reaction of absence, that is, the memorized variables y that are not defined in s due to an enabled delayed assignment but that are required in a successor s of s, where y is also not immediately defined. As the description suggests, these auxiliary definitions of Figure 8 correspond to the ones given in Figure 5. The same holds for the definition of the reqvars s sets, which is also defined as the least fixpoint of the following equation system. Definition 5.2 (Required Variables). The following system of mutually dependent fixpoint equations describes the desired reqvars x sets for the EFSM states s0 , . . . , sn. ⎡

μ

reqvars s0 = UsedByNowVars(s0 ) ∪ UseNext(s0 ) ∪ UsedByAbsVars(s0 ) ∪ V O ⎢ .. ⎢ ⎢ . ⎣ μ reqvars sn = UsedByNowVars(sn) ∪ UseNext(sn) ∪ UsedByAbsVars(sn) ∪ V O



⎥ ⎥ ⎥. ⎦

Hence, each state always computes the output variable V O . Additionally, it computes the variables that are needed due to direct data dependencies, which is reflected in the definitions UsedByNowVars and UsedByNxtVars. Finally, it also keeps the variables that have been previously computed and will be used in some future step. Figure 9 shows the results of the analysis for the example in Figure 7, where the boolean variables s1 and s2 hold in states s1 and s2 , respectively. The analysis finds out that the addition and multiplication of i1 and i2 are used in alternating steps so that the addition can be removed from s2 and the multiplication can be removed from s1 . It also reveals that the inputs a and b are not read in state s2 (according to reqvars s2 ). ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:19

Fig. 9. Applying the semi-symbolic analysis to the example of Figure 7.

This condition can be used in the dataflow analysis for components that connect to the given module. Similarly, introducing redundancy to the output o may result in more redundancy in the component. THEOREM 5.3 (CORRECTNESS OF REQUIREDNESS CONDITIONS). A relaxed transition system with respect to the computed required variables preserves the behavior of the original system. PROOF. Assume (S, s0 , T , D) is an EFSM. The auxiliary definitions given in Figure 5 are conservative approximations of the definitions given in Figure 8. The following implications can be easily verified such that for all v ∈ V and for all s ∈ S, s |= UsedByNow(v) → v ∈ UsedByNowVars(s) v ∈ DefNowVars(s) → s |= DefByNow(v), s |= UsedByNxt(v) → v ∈ UsedByNxtVars(s) v ∈ DefNxtVars(s) → s |= DefByNxt(v), s |= UsedByAbs(v) → v ∈ UsedByAbsVars(s), where s |= ϕ means that ϕ evaluates to true at state s. Thus, UsedByNowVars and UsedByNxtVars are overapproximated, whereas DefNowVars and DefNxtVars are underapproximated, which leads to an overapproximation of UsedByAbsVars. The correctness of Definitions 4.3 and the usage of overapproximations imply the correctness of Definition 5.2. Our methods and the classical dataflow analysis share some similarities. Indeed, the computation of passive code is similar to the computation of the liveness of variables, and one of their purposes is dead code elimination, which also aims at reducing the effect of useless code (here we take the interpretation of dead code in the sense of classical dataflow analysis, that is, reachable but never useful code). However in dead code elimination, the dead code is never useful (and can therefore be safely removed), while in our case, the passive code can sometimes be useful and sometimes be useless, that is, it depends on a condition whether it is useful or not. The usefulness of the code depends on the required predicate. This dynamic view of being useless is different to dead code elimination. We intend to trade the computation of required predicates for the computation of passive code. A major reason for applying the optimization at the level of the synchronous language (and not at the generated sequential code) is the need to take care of the semantics of synchronous languages for the optimization. In particular, the following. ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:20

J. Brandt et al. Table I. Experimental Results—Size and WCRT Reduction Example Cruise Control Car Velocity Logical Sensors 1 Logical Sensors 2 Logical Sensors 3 Island Traffic Control MinePumpController BuildPixel IDCT Dequantization DeZigZag DLXStalling DLXForwarding DLXSim

before optimization AIF WCRT 58 kB 184 142 2,469 2,628 8,022 7,483 24,758 7,566 4,444 360 49 214 2,852 12,914 8,038 45,281 721 434 5,610 — 1,636 — 152,523 —

after optimization AIF WCRT 48 kB 138 142 2,241 2,628 7,839 7,482 23,953 7,478 3,095 330 48 211 2,839 12,903 7,614 41,202 721 434 1,713 — 396 2,142 24,487 —

−25 %

−5 % −2 % −9 % −2 % −1 % −10 %

(−70 %) (−76 %) (−84 %)

—Notion of Macro Steps. Indeed, the required conditions are best explained in macrostep semantics. This would be much more difficult in the synthesized sequential code. For example, in our case, the code is synthesized from the EFSM, where each state corresponds to the actions of a macro step. In the generated code, however, the notion of macro step is rather blurred or even disappeared: we have a single loop in the generated sequential code, but this loop cannot be treated as a single macro step, because it encodes the states and the transitions of the EFSM. Thus, it is not straightforward to rebuild the entire information needed for performing a dataflow analysis based on the macro-step semantics. —Distinction of Input/Output Variables. At the beginning of each macro step, all values of the input variables are read (by the semantics of synchronous languages). There is no code for this in the synchronous program. In the later generated sequential code, this is expected to happen in parallel, for example, by sensors via memory-mapped IO (so the sensors directly write to the memory that is read by the sequential program). The definition of the requiredness predicate takes care of these effects, but a compiler for the sequential code has no chance to do this, since the notion of inputs and outputs is lost there. 6. EXPERIMENTAL RESULTS

We evaluated our dataflow analysis with the help of some practical examples. As already outlined at the beginning of Section 5, we use the semi-symbolic analysis in our compiler to optimize the runtime efficiency of generated software. Our F# implementation is based on our Averest framework4 , and we use the SMT solver Z3 [Mendonc.a de Moura and Bjørner 2008], which was directly connected via its .NET managed API. In total, we evaluated more than a dozen examples from the Averest benchmark suite (shown in Tables I and II). The example CruiseControl is a small high-level specification of a cruise control system. The system Logical Sensors is a larger dataflow example, which describes the sensor component of a vehicle stability system. It consists of many small components, like CarVelocity, which are responsible for the computation of a dedicated sensor value. As the complete sensor fusion LogicalSensors consists of too many EFSM states for a global optimization, our experiments use three partitions 4 http://www.averest.org.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:21

Table II. Experimental Results—Runtime Simulation Example Cruise Control Logical Sensors 1 Logical Sensors 2 Logical Sensors 3 Island Traffic Control MinePumpController BuildPixel IDCT

before optimization runtime #instr. 0.00126 s 454,782 0.014497 4,795,216 0.06649 24,858,964 0.04953 16,022,001 0.0043 1,690,885 0.0037 1,450,783 0.04521 17,204,274 5.5486 335,441,060

after optimization runtime #instr. 0.00117 s 436,793 0.011551 3,151,055 0.0221 10,710,031 0.04744 15,482,417 0.00272 1,013,897 0.0036 1,425,721 0.04511 17,203,905 5.236 317,539,090

−4 % −35 % −57 % −4 % −41 % −2 % −1 % −6 %

(LogicalSensors1, LogicalSensors2, and LogicalSensors3). Besides controllers, we also applied our method to data-processing software, in particular, a JPEG decoder. Two of the four components (BuildPixel and IDCT) can be improved via optimization, while the other two components (Dequantization and DeZigZag) simply gave no improvement. The final three examples show the full potential of our optimization when applied in a the context of component-based design. DLXSim is a simulator of an MIPS processor, and DLXForwarding and DLXStalling are parts of it describing its pipeline (variants with stalling and forwarding). All parts of the system have a structural description, that is, they are mainly composed of predefined components. As not all of them are used in all steps, our analysis can identify and eliminate a great amount of passive code in the system. All examples are either written in Quartz or Opal, and our compiler translated them to SGAs in XML-based AIF (Averest Intermediate Format) files. Our optimization procedure reads such an AIF file and writes the optimized result to another AIF file. For optimization, it performs the semi-symbolic dataflow analysis and uses its result to deactivate passive code. For all benchmarks, the optimization procedures returned a result within 5 to 10 minutes on a standard desktop computer, which is acceptable for an aggressive code optimization. For the original (still containing passive code) and the optimized version, we generate C code, which is finally compiled by gcc. For both variants, we chose the target arm-elf (ARM processor, ELF binary format), a common platform for embedded systems. By enabling all optimizations in the C compiler (compiler switch -O3), we check whether our procedure finds more optimizations than gcc on the generated C code. Our first metric is the size of the intermediate code (AIF) and thereby the size of the two EFSMs: as shown in Table I, our compiler only eliminated dead code in the first EFSM (column before optimization), while it additionally analyzed and suppressed passive code as just described in the second one (column after optimization). The rightmost column shows the percentage of the reduced size of the source codes. In these experiments, we first removed unreachable code, then we removed dead code, and then we added the required predicates to the remaining passive code. This process has been iterated until a fixpoint was reached. Note that the elimination of unreachable code can generate further dead code, which can again be safely removed. This affects the size of the program after optimization, sometimes with significant reductions. Moreover, we employed an SMT solver for checking the satisfiability of the guards of guarded actions, which gives us a more precise optimization result compared with current classical compiler optimizations. Finally, another reason for the reduced code size is the logical optimization of the guards. Note that the optimization eliminates all unreachable and dead codes. For a pure passive code analysis which only performs adding required predicates to local variables, we observe an average increase of 20% of the code size. However, in many cases, we still got a reduced code size after the ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:22

J. Brandt et al.

elimination. Most of our examples have been improved by applying our optimization procedures. Size reductions vary from some kBs to couples of MBs. Size reductions on bigger examples show the full potential: our experiments included examples with size reductions of up to 84%. Our second metric is the estimated worst-case reaction time (WCRT), which we obtained from the WCET analyzer Bound-T.5 Bound-T determines for architecturespecific executables a safe upper bound for the number of CPU cycles the program requires on a specified processor. There are several versions of Bound-T, which support various target platforms; we used the version for ARM7 processors with the core of an ARM7 TDMI processor. For each benchmark, we used Bound-T to determine the WCET for an arbitrary single step of the synchronous program (i.e., the worst-case reaction time (WCRT)). Our third metric is the simulated runtime of the executables. We employed the simulator Xeemu [Herczeg et al. 2007], which simulates an ARM-based Intel 80200 XScale and its SDRAM. For each benchmark, we created 40 typical test cases and simulated the execution of the first 10,000 reaction steps (many benchmarks implement a reactive system, which never terminates). Table I also shows the results of the WCET analysis for our examples. The column headers should be self-explanatory. Empty cells in the table indicate that we have not carried out the WCET analysis, since there is no difference between the original AIF file and the optimized file. A dash (—) means that the program could not be analyzed by Bound-T due to technical reasons (e.g., since called functions make use of unanalyzable library calls or the target executable was too large). Typically, there is a correlation between code size and runtime performance. While preserving the general structure of the EFSM, a reduction of its size typically leads to a reduction of time (in particular for the average case considered next), since computations of the program are eliminated. The results of the WCET analysis also show that the examples have better performance with our optimizations: for the successfully tested examples, the WCRT is improved by 5%–10% for most of them. Table II shows the simulation results for all the examples of Table I, which our procedure has optimized. The first column gives the name of the example, the second group of columns the data before passive code elimination, and the last one gives the corresponding figures for the optimized version. For both variants, we determine the average runtime and the number of executed instructions. We obtain results similar to the ones of the WCRT analysis. The average runtime of the examples is reduced by our procedure. For each example, we notice a correlation between the improvements of the WCRT and the simulation experiment, where the speedup of the simulation is usually better than the improvement of the WCRT bound. Obviously this is due to the fact that the WCRT improvement corresponds to the worst reaction of the example. To summarize, first the improvements obviously depend on the given program. Second, the larger the programs, the more optimizations can be typically made. This is often the case due to large parallel loops with interdependencies. Third, the more predefined components a system uses, the more passive code it will typically contain. 7. SUMMARY

In this article, we presented a static dataflow analysis for synchronous programs, which works on synchronous guarded actions as an intermediate code representation. Thus, it is independent of the target platforms as well as independent of a particular synchronous language (since all synchronous languages can be compiled to guarded actions). Since not all values of the internal variables of a program are required in 5 http://bound-t.com

and http://www.tidorum.fi/bound-t.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:23

each reaction step to compute the outputs, our optimization procedure computes for every internal variable a condition that determines whether its value is required for the current or future computation of outputs. We have presented fixpoint equation systems as the formal foundation of our code optimization. Moreover, we have considered a semi-symbolic approach to the code optimization procedure, which determines the set of required variables for every reachable control-flow state of the synchronous system. The identification of passive code, as obtained by the required variables of a state, can be used to reduce the energy consumption of generated hardware circuits or to speed up the reaction time of generated software. Typical applications that benefit from this optimization are found in systems whose behavior is based on certain modes that use different functional units to compute the system’s output. Such situations often result from the reuse of already existing components. ACKNOWLEDGMENTS We would like to thank Niklas Holsti from Tidorum Ltd. for providing useful comments and discussions about the experiments with Bound-T.

REFERENCES F. E. Allen. 1970. Control flow analysis. ACM SIGPLAN Not. 5, 7, 1–19. Y. Bai. 2010. Dependency analysis of synchronous programming languages. M. S. thesis. Department of Computer Science, University of Kaiserslautern, Germany. Y. Bai, J. Brandt, and K. Schneider. 2011. Data-flow analysis of extended finite state machines. In Proceedings of the Conference on Application of Concurrency to System Design (ACSD). B. Caillaud and J. Carmona, Eds., IEEE Computer Society. A. Benveniste, P. Caspi, S. Edwards, N. Halbwachs, P. Le Guernic, and R. de Simone. 2003. The synchronous languages twelve years later. Proc. IEEE 91, 1, 64–83. G. Berry. 1999. The constructive semantics of pure Esterel. http://www-sop.inria.fr/esterel.org/. G. Berry and G. Gonthier. 1992. The Esterel synchronous programming language: Design, semantics, implementation. Sci. Comput. Program. 19, 2, 87–152. R. Bloem, H. N. Gabow, and F. Somenzi. 2000. An algorithm for strongly connected component analysis in n log n symbolic steps. In Formal Methods in Computer-Aided Design (FMCAD). W. A. Hunt and S. D. Johnson, Eds., Lecture Notes in Computer Science, vol. 1954, Springer, 37–54. J. Brandt and K. Schneider. 2009a. Separate compilation for synchronous programs. In Proceedings of the 12th International Workshop on Software Compilers for Embedded Systems (SCOPES). H. Falk, Ed., ACM, 1–10. J. Brandt and K. Schneider. 2009b. Static data-flow analysis of synchronous programs. In Proceedings of the 7th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE). R. Bloem and P. Schaumont, Eds., IEEE Computer Society, 161–170. J. Brandt and K. Schneider. 2011. Separate translation of synchronous programs to guarded actions. Internal report 382/11, Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany. K. M. Chandy and J. Misra. 1989. Parallel Program Design. Addison-Wesley, Austin, Texas. R. Cleaveland, M. Klein, and B. Steffen. 1993. Faster model checking for the modal μ-calculus. In Computer Aided Verification (CAV), G. von Bochmann and D. Probst, Eds., Lecture Notes in Computer Science, vol. 663, Springer, 410–422. E. Closse, M. Poize, J. Pulou, J. Sifakis, P. Venter, D. Weil, and S. Yovine. 2001. TAXYS: A tool for the development and verification of real-time embedded systems. In Computer Aided Verification (CAV), G. Berry, H. Comon, and A. Finkel, Eds., Lecture Notes in Computer Science, vol. 2102. Springer, 391–395. E. Closse, M. Poize, J. Pulou, P. Venier, and D. Weil. 2002. SAXO-RT: Interpreting Esterel semantics on a sequential execution structure. Electron. Notes Theor. Comput. Sci. 65, 5, 80–94. M. Dam. 1994. CTL* and ECTL* as fragments of the modal μ-calculus. Theor. Comput. Sci. 126, 1, 77–96. J. W. de Bakker. 1976. Least fixed points revisited. Theor. Comput. Sci. 2, 2, 155–181. E. W. Dijkstra. 1975. Guarded commands, nondeterminacy and formal derivation of programs. Comm. ACM 18, 8, 453–457.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

67:24

J. Brandt et al.

D. L. Dill. 1996. The Murphi verification system. In Computer Aided Verification (CAV), R. Alur and T. Henzinger, Eds., Lecture Notes in Computer Science, vol. 1102. Springer, 390–393. S. Edwards. 2002. An Esterel compiler for large control-dominated systems. IEEE Trans. Comput.-Aid. Des. Integr. Circuits Syst. 21, 2, 169–183. S. A. Edwards, V. Kapadia, and M. Halas. 2004. Compiling Esterel into static discrete-event code. In Proceedings of the 3rd International Workshop on Synchronous Languages, Applications, and Programming (SLAP). Electronic Notes in Theoretical Computer Science, vol. 133, 4, 117–131. E. A. Emerson, C. S. Jutla, and A. P. Sistla. 1993. On model-checking for fragments of μ-calculus. In Computer Aided Verification (CAV), C. Courcoubetis, Ed., Lecture Notes in Computer Science, vol. 697. Springer, 385–396. E. A. Emerson, C. S. Jutla, and A. P. Sistla. 2001. On model checking for the μ-calculus and its fragments. Theor. Comput. Sci. 258, 1–2, 491–522. N. Halbwachs. 1993. Synchronous Programming of Reactive Systems. Kluwer. N. Halbwachs, P. Raymond, and C. Ratel. 1991. Generating efficient code from data-flow programs. In Pro´ gramming Language Implementation and Logic Programming (PLILP), J. Maluszynski and M. Wirsing, Eds., Lecture Notes in Computer Science, vol. 528. Springer, 207–218. M. S. Hecht and J. D. Ullman. 1973. Analysis of a simple algorithm global data flow problems. In Proceedings of the Symposium on Principles of Programming Languages (POPL). ACM, 207–217. M. S. Hecht and J. D. Ullman. 1974. Characterizations of reducible flow graphs. J. ACM 21, 3, 367–375. Z. Herczeg, A. Kiss, D. Schmidt, N. Wehn, and T. Gyim´othy. 2007. XEEMU: An improved XScale power simulator. In Power and Timing Modelling, Optimization and Simulation (PATMOS), N. Azemard and L. Svensson, Eds., Lecture Notes in Computer Science, vol. 4644. Springer, 300–309. B. Jiang. 1993. I/O-and CPU-optimal recognition of strongly connected components. Inf. Proces. Lett. 45, 3, 111–115. L. Ju, B. K. Huynh, S. Chakraborty, and A. Roychoudhury. 2009. Context-sensitive timing analysis of Esterel programs. In Proceedings of the Design Automation Conference (DAC). ACM, 870–873. L. Ju, B. Khoa Huynh, A. Roychoudhury, and S. Chakraborty. 2010. Timing analysis of Esterel programs on general purpose multiprocessors. In Proceedings of the Design Automation Conference (DAC). S. Sapatnekar, Ed., ACM, 48–51. ¨ H. Jarvinen and R. Kurki-Suonio. 1990. The DisCo language and temporal logic of actions. Tech. rep. 11. Software Systems Laboratory, Tampere University of Technology. D. Kozen. 1983. Results on the propositional μ-calculus. Theor. Comput. Sci. 27, 3, 333–354. G. Logothetis and K. Schneider. 2003. Exact high level WCET analysis of synchronous programs by symbolic state space exploration. In Proceedings of the Design, Automation and Test in Europe (DATE). IEEE Computer Society, 10196–10203. L. Mendonc.a de Moura and N. Bjørner. 2008. Z3: An efficient SMT solver. In Proceedings of the Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS). C. Ramakrishnan and J. Rehof, Eds., Lecture Notes in Computer Science, vol. 4963. Springer, 337–340. D. Potop-Butucaru and R. de Simone. 2003. Optimizations for faster execution of Esterel programs. In Proceedings of the IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE). IEEE Computer Society, 227–236. D. Potop-Butucaru, S. A. Edwards, and G. Berry. 2007. Compiling Esterel. Springer. B. K. Rosen. 1977. High-level data flow analysis. Comm. ACM 20, 10, 712–724. P. Ryamond. 1991. Compilation efficace d’un langage d´eclaratif synchrone: Le g´enerateur de code LUSTREV3. Ph.D. thesis, Institut national polytechnique de Gr´enoble, Grenoble, France. D. A. Schmidt. 1998. Data flow analysis is model checking of abstract interpretations. In Proceedings of the Symposium on Principles of Programming Languages (POPL). ACM, 38–48. K. Schneider. 2000. A verified hardware synthesis for Esterel. In Proceedings of the IFIP International Workshop on Distributed and Parallel Embedded Systems (DIPES). F. Rammig, Ed. Kluwer, 205–214. K. Schneider. 2001a. Embedding imperative synchronous languages in interactive theorem provers. In Proceedings of the 2nd International Conference on Application of Concurrency to System Design (ACSD). IEEE Computer Society, 143–154. K. Schneider. 2001b. Improving automata generation for linear temporal logic by considering the automata hierarchy. In Logic for Programming, Artificial Intelligence, and Reasoning (LPAR), R. Nieuwenhuis and A. Voronkov, Eds., Lecture Notes in Artificial Intelligence, vol. 2250, Springer, 39–54. K. Schneider. 2002. Proving the equivalence of microstep and macrostep semantics. In Theorem Proving in ˜ C. Munoz, ˜ Higher Order Logics (TPHOL), V. Carreno, and S. Tahar, Eds., Lecture Notes in Computer Science, vol. 2410. Springer, 314–331.

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.

Passive Code in Synchronous Programs

67:25

K. Schneider. 2003. Verification of Reactive Systems - Formal Methods and Algorithms. Texts in Theoretical Computer Science, Springer. K. Schneider. 2009. The synchronous programming language Quartz. Internal Report 375. Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany. K. Schneider and J. Brandt. 2008. Performing causality analysis by bounded model checking. In Proceedings of the 8th International Conference on Application of Concurrency to System Design (ACSD). IEEE Computer Society, 78–87. K. Schneider, J. Brandt, and T. Schuele. 2004. Causality analysis of synchronous programs with delayed actions. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES). ACM, 179–189. K. Schneider, J. Brandt, and T. Schuele. 2006. A verified compiler for synchronous programs with local declarations. Electron. Notes Theor. Comput. Sci. 153, 4, 71–97. T. Schuele and K. Schneider. 2004. Abstraction of assembler programs for symbolic worst case execution time analysis. In Proceedings of the Design Automation Conference (DAC). S. Malik, L. Fix, and A. Kahng, Eds., ACM, 107–112. M. Sharir. 1981. A strong-connectivity algorithm and its application in data flow analysis. Comput. Math. Appl. 7, 1, 67–72. B. Steffen. 1991. Data flow analysis as model checking. In Proceedings of the Conference on Theoretical Aspects of Computer Software (TACS). T. Ito and A.R. Meyer, Eds., Lecture Notes in Computer Science, vol. 526. Springer, 346–364. SYNCHRON. 1998. The common format of synchronous languages - the declarative code DC. Tech. rep. C2A, SYNCHRON project. O. Tardieu and R. de Simone. 2003. Instantaneous termination in pure Esterel. In Static Analysis Symposium (SAS), R. Cousot, Ed., Lecture Notes in Computer Science, vol. 2694, Springer, 91–108. R. Tarjan. 1972. Depth first search and linear graph algorithms. SIAM J. Comput. 1, 2, 146–160. A. Tarski. 1955. A lattice-theoretical fixpoint theorem and its applications. Pacific J. Math. 5, 2, 285–309. A. Xie and P. A. Beerel. 2000. Implicit enumeration of strongly connected components and an application to formal verification. IEEE Trans. Comput.-Aid. Des. Integr. Circuits Syst. 19, 10, 1225–1230. J. Zeng, C. Soviani, and S. A. Edwards. 2004. Generating fast code from concurrent program dependence graphs. In Proceedings of the ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES). D. Whalley and R. Cytron, Eds. ACM, 175–181. Received March 2012; revised December 2012, April 2013; accepted July 2013

ACM Transactions on Embedded Computing Systems, Vol. 13, No. 2s, Article 67, Publication date: January 2014.