SLAP’05 Preliminary Version
A Survey of Automatic Distribution Method for Synchronous Programs Alain Girault 1 Inria Rhˆ one-Alpes, Pop Art project, 655 avenue de l’Europe, 38334 Saint-Ismier cedex, France
Abstract Research on the automatic distribution of synchronous programs started in 1987, almost twenty years ago. Basically, from a single synchronous program, along possibly with distribution specifications (but not necessarily), it involves producing automatically several synchronous programs, communicating between them so as to achieve the same behavior as the initial centralized program. Since 1987, many approaches have been proposed, some general, and others tailored specifically to a given synchronous language. Also, those methods that were basic at the beginning, are now really sophisticated, offering many features. Finally, the successes obtained with these methods have led to the definition of a new model of computation, known as Globally Asynchronous Locally Synchronous (GALS). The goal of this article is to present a survey of all existing distribution methods. Key words: Synchronous programs, automatic distribution, parallelization, desynchronization, GALS.
1
Introduction
The synchronous paradigm has been proposed in the Nineteen Eighties to ease the programming of embedded reactive systems [7,5]. These systems are most of the time distributed, first to offer greater computing power, and second to put the computing power closer to the physical sensors and actuators of the system. For this reason, numerous automatic distribution methods have been proposed. “Automatic” means that, from a given centralized synchronous program P and a given distributed memory architecture A, the mapping of P onto A should be obtained automatically, with the necessary inserted communication instructions, in such a way that the resulting distributed program has a behavior consistent with that of P. 1
Email:
[email protected] This is a preliminary version. The final version will be published in Electronic Notes in Theoretical Computer Science URL: www.elsevier.nl/locate/entcs
Girault
In this survey, I start by summarizing the basic compiling methods for synchronous programs in Section 2. This is necessary in order to understand the distribution methods since most of these were conceived as post-processors of synchronous compilers. Section 3 presents the motivations for distribution as well as the general principles. Section 4 is the core of my article: I gather all the existing distribution methods (to the best of my knowledge). Finally, in Section 5, I give some hints on how to chose the most suited distribution method when confronted to a particular problem.
2
Compiling methods and intermediated formats
A synchronous compiler must perform the causality analysis of the source program and produce sequential code. More than the compiling method, it is on the generated code that I want to put the emphasis in the present section. There exist several types of it, with varying kinds of control structures. In the sequel, I mention explicitly these output formats produced by the synchronous compilers by their names: OC, SC, DC, CP. . . The “finite automaton” OC format The OC format (“Object Code”) is an intermediate format common to the Lustre and Esterel compilers. An OC program is a finite state automaton coupled with a bounded memory for handling the variables. In each state of the automaton, there is sequential code represented by a directed acyclic graph of actions (a DAG). The problem of this explicit representation is the combinatorial explosion due to the synchronous parallelism. For a middle size program, the automaton can have millions of states. This is why two other formats have been proposed, on one hand DC (“Data-flow Code”) for the languages Lustre and Signal-Polychrony, and on the other hand SC (“Sequential Code”) for the language Esterel. Clocks in data-flow languages Before describing DC, I must mention clocks [11]. Lustre and SignalPolychrony are both synchronous data-flow languages, where each variable manipulated by the program is a stream, an infinite sequence of typed values. In this context, a clock is a form of temporal type [11,18]. The clock of a stream is the sequence of the logical instants where this stream bears a value. In contrast with many circuit description formalisms, in Lustre and Signal, any Boolean stream can be a clock. Lustre and Signal offer operators to down-sample and up-sample streams. Down-sampling creates a slower clock, while up-sampling projects a stream onto a faster clock. In Lustre, a predefined clock exists always: it is the base clock of the program, the sequence of its activation instants (also called logical instants). Besides, all the clocks of a Lustre program belong to the same clock tree, whose root is precisely 2
Girault
the base clock of the program. In Signal, this is not always true and this structure can be a forest, with several roots that are non-comparable clocks. The “declarative network” DC format A DC program is a parallel network of operators, which behaves like a reactive machine transforming, at each reaction, a vector of inputs into a vector of outputs. The sequence of reactions of the program defines the sequence of its logical instants. All the objects manipulated by the program are streams, which bear a value at each logical instant of their respective clock. The program computes the output streams from the input streams (as well as local streams) with conditioned equations. Each such activation condition is a Boolean stream: at each instant where it bears the value true, the equation computes a new value, otherwise it computes its previous value. Besides that, programs are structured as nodes that encapsulate equations; they can invoke external functions as well as imperative procedures, possibly having side effect. Finally, a table indicates the dependencies between the equations. Because of its control structure, the DC format is ideally suited to symbolic verification, silicon compiling, and distributed code generation. The case of Signal-Polychrony is particular. The usual compiling method involves the production of a hierarchical data-flow graph with conditional dependence equations, called SDFG (“Synchronized Data-Flow Graph”). Its nodes are the input and output signals and the variables, while its edges are labeled with clocks. Thus, each node is located within some clock hierarchy. DC code can then be produced from this SDFG. However, from a certain point of view, the Signal language is less strictly synchronous than Lustre: it is considered that Signal is relational while Lustre is functional. For instance, in Signal, it is possible to write programs where the outputs are “more frequent” than the inputs, while such an up-sampling is forbidden in Lustre. In term of clocks, the set of clocks of a Lustre program is a tree whose root is the program base clock, while for a Signal program, in the general case it is a forest, with several roots unrelated one to each other. Therefore, all Signal programs cannot be compiled into DC (where, by construction, there can exist only one clock, the remaining being activation conditions; this can be understood as a flat clock tree, but certainly not as a forest). More precisely, to compile into DC a Signal program having a clock forest with more than one root, one must adjoin it a monitor in charge of giving the presence and absence information of each such root clock with the help of Booleans; in other words, one must adjoin it a master clock acting as the father of all root clock, therefore turning the forest into a tree; hence it involves abandoning the loosely synchronized character of Signal. This is why several other formats, more general than DC, have been proposed, among them DC+ that is a superset of DC. In the context of the present article, I will limit myself to Signal programs that can be compiled into DC. 3
Girault
The “sequential circuit” SC format An SC program is a finite boolean sequential circuit coupled to a table of external actions. The sequential circuit encodes the control part of the program, while the table of actions encodes the data part: the actions manipulate in particular input/output signals, internal variables, external procedures and functions, and so on. The circuit drives the table of actions, meaning that some of its wires invoke one action as soon as it switches from 0 to 1. Once the SC circuit has been produced by the compiler, it is possible to generate target code such as VHDL, C, Java. . . Actually, the representation of a synchronous program as a sequential circuit is dual from the one as a finite state automaton: a circuit with n registers corresponds to an automaton with 2n states. Generating a sequential circuit allows the production of small size target code, in general linear w.r.t. the size of the Esterel source program. Because of its control structure, the SC format is ideally suited to silicon compiling, as well as for a variety of analysis methods: simulation, formal verification, test cases generation. . . The “control points” CP format A third compilation method has been proposed for Esterel programs. It involves, first, the slicing of the source program into small blocks of instructions, separated by control points, and second, their static scheduling in such a way that the synchronous semantics of Esterel is preserved. At each reaction of the program, only the active blocks are executed, which optimizes both the reaction time and the code size. I call this intermediate format CP (“Control Points”). This compiling method has several advantages. The generated code is both small (almost as much as SC) and fast (almost as much as OC), and moreover, the compiling time is short (much shorter than the other compiling methods for Esterel). format
3
control structure
OC
sequential
explicit
static
DC
parallel
implicit
dynamic
SC
parallel
implicit
dynamic
CP
parallel
explicit
static
This table summarizes the characteristics of the control structure of the various intermediate formats. “Static” means that all the possible execution paths are known at compile-time, while “dynamic” means that they depend on the run-time values of the inputs.
Motivation
The need for automatic distribution One crucial feature of embedded systems is not properly addressed by the synchronous model of programming: it is the fact that these systems are often distributed ; this means that they are deployed onto distributed memory 4
Girault
architectures, consisting of a set of processor (each with its own memory), and some communication device (e.g., a bus). The goal of any automatic distribution method is precisely to find, automatically, a mapping of a given centralized synchronous program onto a given target distributed architecture, and to insert into each fragment of the distributed program the necessary communications instructions so as to guarantee that the resulting distributed program has a behavior consistent with the behavior or the centralized program. Most of the time, the target architecture is imposed by the application, hence the problem is not to exhibit the greatest possible parallelism in the source synchronous program, but rather to deploy it onto the given architecture, with some fixed level of parallelism. Also, it happens often that the application consists of some physical device that must be controlled through sensors and actuators; this is, after all, the case of most embedded reactive systems; in such a case, each sensor is physically located near some very precise processor, so the input data read from this sensor must be computed by the corresponding processor; similarly, each actuator is physically located near some very precise processor, so the output data written to this actuator must be computed by the corresponding processor. As a result, I will assume in the sequel that, along with the source synchronous program to be distributed, the user provides a set of distribution specifications, consisting of the list of processors in the target distributed architecture, as well as, for each such processor, the list of inputs/outputs it must compute. The consequences of this situation is that the main goal of automatic distribution methods is not to find the “best” possible distribution of a centralized synchronous program. Indeed, the best distribution is the one that achieves the best load balance between the processors, whatever the optimization criterion, which is, in general, an NP-complete problem; hence most existing algorithms are based on heuristics. Anyway, trying to find the best load balance is, most of the time, contradictory with fulfilling the distribution specifications. In the sequel, instead of processors, I will refer rather to computing locations, because it is often convenient to execute several fragments of a given distributed program onto the same processor, for instance for debugging purpose. If one wants to take into account long duration tasks inside a synchronous program, it might indeed be the only solution [22]. Long duration tasks constitute in themselves a motivation for distributing synchronous programs: these are tasks whose execution time is long compared to the other computations in the application, and whose maximal execution rate is known and bounded. Simply scheduling such a long duration task at a slow rate would not work since the whole program would be slowed down if compiled into sequential code. It would thus be impossible to meet the temporal constraints, unless such long duration tasks could be desynchronized from the remaining computations, which is precisely what is achieved when the program is distributed. 5
Girault
The particularities of synchronous programs The field of automatic distribution is vast and a lot of active research has been existing for a long time. One can see for instance [23] for a survey. In the particular case of synchronous programs, there are two characteristics to be aware of: First, the source program is parallel and not sequential like in a classical programming language (e.g., C, Pascal. . . ). But this parallelism of expression is used by the programmer to conceive his/her application in terms of parallel modules cooperating to achieve the desired behavior. It is therefore not related to the parallelism of execution, which is due to the fact that the target architecture is distributed. Second, the causality analysis performed by the compiler prevents, a priori, to compile modularly. The purpose of the causality analysis is to determine the sequential behavior of several parallel modules connected together by synchronization and communication relationships. It allows moreover to determine exactly what variables are present or absent during the current reaction. A program is declared as being non causal when it is not possible to determine this present/absent information for all its variables. It is precisely this information that allows synchronous programs to react to the absence of its inputs. These two characteristics imply that adapting a classical automatic distribution method to synchronous programs is not trivial at all. This explains why there has been so many researches on the automatic distribution of synchronous programs during the past fifteen years. General principles Several automatic distribution methods exist; in all the cases, the starting point is a centralized source program, written in a synchronous language. These existing methods fall into three categories: (i) To cut the source program into several fragments, to compile them separately, and to make them communicate harmoniously, that is, without deadlock and in such a way that the behavior of the initial centralized source program be equivalent to the joint behavior of the compiled fragments. This approach seems ideal because it is the easiest one. However, Raymond has shown that it is impossible to achieve in the general case [33]. (ii) To obtain by global compiling one sequential program for each fragment, again communicating harmoniously. This approach is appealing but very complex, due to the fact that the code distributor must perform the causality analysis and guarantee that the obtained distributed program is still able to react to the absence of inputs. But, if a synchronous program is cut into two (or more) fragments, how can one make sure that one given fragment reacts to some input of the other fragment? This is the solution that has been adopted for the Signal programs [1,2,27,9,4,32] (see Section 4.7). (iii) To first compile the source program into one unique sequential and centralized program, and then to distribute it. It can be seen as paradoxical to first sequentialize the source program (which is, let us remind it, parallel), and 6
Girault
then to re-parallelize it. But, like I said above, the parallelism of expression is a priori not related to the parallelism of execution. This approach is the easiest one to implement, which explains why it has had so many successes. Indeed, most of the methods that I will present in Section 4 fall into this category. Globally asynchronous locally synchronous systems (GALS) Once the distribution process is completed, we have a distributed program where each fragment is synchronous and where the fragments communicate with each other in an asynchronous manner. Such a system can therefore be called globally asynchronous locally synchronous (GALS [16]). This paradigm is used both for hardware and software. In software, it is used to compose blocks specified as finite state machines and to make them communicate asynchronously [3]. This approach is particularly well suited to the design of embedded systems. In hardware, more and more circuits are designed as synchronous blocks communicating asynchronously rather than as big synchronous circuits [28]. This avoids the need to propagate the clock everywhere in the big synchronous circuit, a task always difficult to perform and very costly energy-wise. This method thus contributes to diminishing the power consumption of the obtained GALS circuit [25].
4
Survey of existing automatic distribution methods
This section is the core of my survey. I present all the existing automatic distribution methods that I am aware of. 4.1
The Ocrep code distributor
The ocrep code distributor acts as a post-processor for both the Lustre and the Esterel compilers. It takes as input one OC program and a file of distribution specifications, and produces as output several OC programs, one for each computing location of the distribution specifications [15]. An OC program is a finite deterministic automaton. This state graph can be cyclic, but in each state, there is sequential acyclic code, represented by a rooted binary directed acyclic graph (DAG) of actions. A program manipulates three kinds of variables: input variables can be used only as right-values; local and output variables can be used as right-values and left-values; output variables are also written to the environment. Each DAG has one root, several unary and binary nodes, and one or more leaves: •
Unary nodes are sequential actions, which can be either: · an indication that the input variables of the program have been read and that their values are available in the local memory of the program: go(...,ini ,...), where ini are the inputs, 7
Girault
· an assignment to a local or output variable: var:=exp, where exp can contain external function calls, · an output writing: write(var), · an external procedure call: proc(...,vari ,...)(...,valj ,...), where vari and valj are respectively the variable and value parameters. •
Binary nodes are deterministic branchings: if (var) then p else q endif, where p and q are subdags.
•
Leaves, and only leaves, denote the next state number: goto n.
The interaction of an OC program with its environment is made explicit by the go actions. In a program with a list of inputs i1 ,...,in , the first node of each DAG is the action go(i1 ,...,in ). This automaton format is quite general since programs written in a classical imperative programming language can be compiled into this format. In fact, any OC program can be translated into a flow graph of basic blocks, and vice-versa. Finally, concerning the execution, an OC program is embedded inside an execution loop: at each cycle, the inputs are read from the environment, one transition of the automaton is executed (i.e., the code of the current state’s DAG is executed), and the outputs are written to the environment. For a reactive system, checking the temporal constraints amounts to verifying that the automaton can be run inside an execution loop whose period is less than the maximal time allowed by the temporal constraint. The distribution algorithm of ocrep involves the following successive steps: (i) assign a unique computing location to each sequential action, according to the distribution specifications provided by the user; these specifications are a partition of the set of inputs and outputs of the program into n subsets, one for each computing location of the final parallel program; (ii) replicate the program onto each location; (iii) on each location, suppress the sequential actions not belonging to the considered location; (iv) on each state of the automaton, insert sending actions in order to solve the data dependencies between two distinct locations; then insert receiving actions in order to match the sending actions; (v) if required by the user, on each state, add the needed dummy communications in order to resynchronize the distributed program. As a communication mechanism, FIFO queues are used. Concretely, between any pair of computing locations, two FIFOs are created. The communication primitives used are the send, always non blocking, and the receive, blocking when the FIFO is empty. These primitives perform both the datatransfer and the synchronization needed between locations. The only requirement on the network is that it must preserve the ordering and the integrity 8
Girault
of the messages. Provided that the send actions are inserted on one location in the same order as the corresponding receive actions in the other location (and this is exactly what the algorithm above does), it will ensure that values are not mixed up. Concerning the implementation, Unix/Internet sockets work perfectly fine [14]. To make the obtained distributed programs less sensitive to the communication latencies, the sends are inserted as early as possible in the OC program, while the receives are inserted as late as possible. However, because of the physical meaning of the state (materialized by the go action present at the beginning of each state), neither of the two insertion algorithms crosses the state barrier. That is, the sends and receives are inserted state by state. Also, the correctness of this distribution algorithm has been formally proved in [10], meaning that the obtained distributed program is functionally equivalent to the initial centralized one. One drawback of the communication mechanism used by this method is that, if one processor fails, then the process it was running will inevitably fail to send one value through one of its outgoing FIFOs. So the process expecting this value will be stuck waiting for it (remember that the receive is blocking when the queue is empty). So this other process will, in turn, also fail to send one value, causing a third process to be stuck, and so on. Eventually, all the processes will be stuck, causing the failure of the entire system. Finally, recent developments now allow the user to write a synchronous program with one or several long duration tasks, and to distribute it into several processes running at different rates, in such a way that each process having to execute one of these long duration tasks is given enough time to complete while still meeting its temporal constraint (something impossible is the program is centralized) [22]. This method uses an original on-the-fly bisimulation to eliminate useless branchings [13], whose effect is to desynchronize the rates of the fragments of the distributed program. It works both with Lustre and with Esterel programs. In the Lustre case, it allows the distribution specifications to mention the clocks of the program (for instance saying that all the computations performed on a given clock must be located on a given computing location), even though the OC program itself has no clock (in OC, every computation is on the base clock). In the Esterel case, it allows the creation of rate desynchronized distributed programs; and yet the notion of clock/rate does not even exist in the language. To the best of my knowledge, it is the only method achieving such a result. It should be feasible to adapt it to the Signal compiler, but this remains to be done. 4.2
Distribution of Esterel programs within SAXO-RT
Saxo-RT is the Esterel to C compiler developed by France Telecom R&D [17]: see Section 2 for a description of the compiling method and the internal format CP. In this context, the distribution method takes place after generating the 9
Girault
CP control structure of the Esterel source program. Let us consider a CP program made of p control points. Each control point i is itself made of one Boolean exei , one block of sequential code called “task i”, and a second Boolean pausei . At each cycle of the program, the Booleans exei are tested sequentially, and for each of them whose value is true, the task i is executed. Each task i can update the value of the Booleans exei+1 to exep (hence only those of the control points that are after the task i in the same cycle), as well as the value of all the Booleans pausej (hence of all the control points of the next cycle). As a result, the causality is preserved. At the end of the cycle, the vector (pause)1,p is copied into the vector (exe)1,p . Finally, the code of each task can only include control actions of the form if...then...else..., and hence no loop. In other words, it is purely sequential code, like in the states of an OC automaton. The distribution method implemented inside Saxo-RT takes as input one Esterel program and one file of distribution specifications, consisting of a partition of the inputs/outputs of the Esterel program into n subsets, one for each computing location that is desired [20]. The distribution algorithm first replicates the control structure of the CP program onto each of the n computing locations specified in the distribution specification file provided by the user, and second, for each control point, applies the algorithm of ocrep (see Section 4.1) to the sequential code of the corresponding task; the communication mechanism used is the same as with ocrep: FIFOs. Since this algorithm is integrated inside the Saxo-RT compiler, it can only be used through this compiler. In contrast, ocrep works as a post-processor of the Lustre and Esterel compiler, so it is independent of the compiler actually used; only the output format matters. But, unlike ocrep, Saxo-RT does not suffer from the combinatorial explosion problem. 4.3
The Screp code distributor
The screp code distributor acts as a post-processor for the Esterel compiler. It takes as input one SC program and a file of distribution specifications, and produces as output several SC programs, one for each computing location of the distribution specifications [21]. An SC program consists of a control part, a Boolean synchronous sequential circuit, and a data part, a table of external actions allowing the manipulation of the program’s variables. These actions are the same as the sequential actions of OC (see Section 4.1). A program has a set of input and output signals. Each of these can be pure or valued, in which case the signal is associated to a local variable of the corresponding type. Local variables are manipulated by actions of the table. The sequential circuit is made of boolean gates, registers, and special nets that trigger actions of the table. The program has a periodic behavior and a central clock drives all the registers (this is the base clock of the program). 10
Girault
In the textual representation, a circuit is simply a list of numbered nets. Each net is connected in input to a simple boolean gate, represented by its input boolean expression: it is either a conjunction or a disjunction of nets or negated nets. Expressions cannot be nested, and two expressions are predefined: 0 and 1. The complete list of net kinds is the following: •
A standard net defines a net with an input boolean expression. It allows the building of complex boolean expressions.
•
An action net drives an action defined in the action table. This action is triggered whenever the net bears the value 1. An action can be either a variable assignment with any expression at the right-hand side, or an external procedure call with possibly several variable parameters and with any expression at the value parameters.
•
An ift net drives an expression test defined in the action table. This test action is triggered whenever the net takes the value 1. The ift net is assigned the result value of the test.
•
An input net does two things: first it reads the input signal and sets the corresponding presence variable, and second it propagates 1 if the input signal is present, and 0 otherwise. So actually, this is represented by an input part and an ift part. The ift tests the presence Boolean of the input, and it behaves exactly like the ift net above. It also sets the variable associated to the input signal when this one is valued. It is the only net with no expression, since it is always executed at each clock tick (very much like the go actions of OC).
•
An output net corresponds to an output signal. It behaves like an action net by triggering the emit action whenever it bears the value 1. If the output signal is valued, then the emit action must have an expression of the output signal type as parameter.
•
A register net has a single fanin and an initial value (0 or 1).
The action triggering nets are therefore action, ift, input, and output, while the non-triggering nets are standard and register. The semantics of this program model is based on the zero-delay assumption: the circuit is viewed as a set of Boolean equations that communicate their results to each other in zero time. Since the circuit is acyclic, the equations can be totally ordered, such that any variable depends only on previously defined ones. Then, for any valuation of the registers and of the inputs, there exists a unique solution to the set of equations, meaning that the circuit has a unique behavior [6]. Also, only causal programs are considered, meaning that any given variable can only be modified in one parallel branch of the control structure. The purpose of this causality property, which has nothing to do with the control structure itself, is only to avoid non-deterministic programs. The principle of the screp algorithm is the following: 11
Girault
(i) First, one set of computing locations is assigned to each action of the circuit: each action will be executed by a single location, except all the ifts that will be computed by all locations. From this, one circuit can be obtained for each computing location: •
The data part is obtained by removing all the non relevant actions.
•
The control part is obtained by taking the original control part, changing each action and output net whose action is not assigned to the current computing location into a standard net, and changing each input net into what is called a simulation block (see below Step iv). In contrast, each ift net is replicated on all the control parts.
However, the algorithm still works on a single circuit, until Step iii when it will generate one circuit for each computing location. Until then, each computing location will thus have a virtual circuit. (ii) After this first step, the virtual circuit of each computing location makes references to variables that are not computed locally and to inputs that are not received locally. Since the target architecture is a distributed memory one, each distributed program only modifies its local variables (owner computes rule), and therefore has to maintain its local copy of each distant variables and inputs, i.e., those belonging to another computing location. To achieve this, the algorithm adds communication instructions to each virtual circuit to solve the distant variables dependencies. (iii) At this point, one actual circuit is generated for each computing location by copying each virtual circuit into a different file. (iv) Finally, input simulation blocks are added to solve the distant inputs dependencies. Without entering into the details, their purpose is to prevent useless communications by sending the presence information of the input signal only to those computing locations that need them. The communication mechanism is the same a for ocrep (see Section 4.1), namely FIFO queues, except that, between any pair of computing locations, n FIFOs are created, one for each variable of the program (instead of only 2 for ocrep). This is necessary because the control structure of an SC circuit is parallel, hence two parallel branches can both perform a send towards the same computing location. In order to guarantee that the values are not mixed, it is necessary to identify the sent value to the corresponding variable. Concerning the implementation, Unix/Internet sockets work perfectly fine, and it suffices to open two sockets instead of n per pair of computing locations and to insert the couple hvariable, valuei instead of the value only. 4.4
Compiling of Esterel towards CFSM
Besides screp, I do not know any result directly on the automatic distribution of synchronous circuits. The closest related works are found in the production of GALS from synchronous circuits, and indeed the two approaches (distribu12
Girault
tion and production of GALS) are tightly connected. So the closest results are those of Berry and Sentovich [8]: they implement constructive synchronous circuits as a network of communicating Codesign Finite State Machines (CFMSs) inside Polis [3], which are by definition GALS systems. There are a number of differences with screp. First, the authors consider cyclic synchronous circuits, with the restriction that these cyclic circuits must be constructive [34]. A constructive circuit is a “well-behaved” cyclic circuit, meaning that there exists an acyclic circuit computing the same outputs from the same inputs. However, their synchronous circuits only manipulate Booleans. In contrast, screp only considers acyclic synchronous circuits, but they also manipulate valued variables such as integers, reals, and so on. Second, the CFSMs communicate with each other through non-blocking 1place communication buffers, while screp uses blocking n-places FIFO queues. Third, their method for obtaining a GALS system involves partitioning the set of gates into clusters, implementing each cluster as a CFSM, and finally connecting the clusters inside a Polis network. They therefore have the possibility to choose among several granularities, ranging from one gate per cluster to a single cluster for the whole synchronous circuit. On the other hand, they do not give a method to obtain such a clustering automatically. Finally, the CFSMs communicate with each other in order to implement the constructive semantics of Esterel (this is required because their circuits can be cyclic). This means that a CFSM communicates facts about the stabilization of its local gates, so that the other CFSMs can react accordingly. The principle is that the network of CFSMs as a whole behaves exactly as the source centralized synchronous circuit. In contrast, each circuit of the GALS systems obtained with screp communicates values, and the global coherency is ensured because each circuit implements the whole control structure. As a consequence, with screp the programs of the computing locations are larger but the number of communications is much smaller. In a distributed architecture where communications are often more expensive than computations, this can represent a valuable advantage. 4.5
Deployment of Lustre programs over “time triggered” architectures
Within the Verimag laboratory, Caspi and his colleagues have pursued research on the distribution of Lustre programs [12] and their deployment over TTA architectures (“Timed Triggered Architecture” [26]). The TTA architecture is based on the TTP communication protocol (“Timed Triggered Protocol”), which is an implementation of the TDMA model (“Time Division Multiple Access”). The principle is that each processor willing to communicate is assigned a communication window, the sequence of windows being the same at each cycle. It is thus a perfectly synchronous behavior. In this work, the authors claim that, in the automatic control industry, 13
Girault
Simulink is a de facto standard. Thus, they propose to translate Simulink programs into Lustre, before deploying them over TTA architectures. They first identify four main differences between Lustre and Simulink: (i) discretetime versus continuous-time semantics; (ii) unique and precise versus simulation dependent semantics: some Simulink models are accepted if one chooses variable-step simulation, and rejected if one chooses fixed-step, auto, or multithreaded simulation; (iii) strong and explicit types versus non explicit types; (iv) modular versus non modular design: the sample time of a subsystem B embedded into a subsystem A can be anything, for instance faster than the period of A. This implies six consequences for the translation of Simulink programs into Lustre: (i) only the discrete-time and non-ambiguous part of Simulink is translated; (ii) the Simulink simulation method must be solver=fixedstep +discrete and mode=auto; (iii) the Lustre program must be run at the time period the Simulink model was simulated; (iv) the Simulink model must have the Boolean logic signals flag on; (v) Simulink models with algebraic loop are rejected; (vi) the Simulink hierarchy should be preserved. Then the authors propose four extensions of Lustre to allow the generation of distributed code. These extensions are annotations which can be: (i) location=P to specify where a block should be executed; (ii) (hyp) basic period=p to specify the basic period of a block; periodic clocks can be defined with the instruction periodic cl(k,p); (iii) (hyp) exec time(A) in [l,u] specifies that the WCET of A is between l and u; (iv) (req) date(y)-date(x)