Periodic Schedules for Cyclo-Static Dataflow

Periodic Schedules for Cyclo-Static Dataflow Bruno Bodin

Alix Munier-Kordon

Benoît Dupont de Dinechin

KALRAY SA 445 rue Lavoisier 38330 Montbonnot, FRANCE Email: [email protected]

LIP6-UPMC Place Jussieu 75005 Paris, FRANCE Email: [email protected]

KALRAY SA 445 rue Lavoisier 38330 Montbonnot, FRANCE Email: [email protected]

Abstract—Cyclo-Static Dataflow Graphs (CSDFGs in short) is a static model commonly used to describe communications between processes. It is increasingly considered for modeling applications executed by many-core architectures; their static analysis becomes thus essential for developing efficient compiletime optimization. This paper aims to develop efficient algorithms to approximately solve two main difficult problems: the determination of the maximum throughput of a CSDFG and the optimization of the buffer sizes with a minimum required throughput. They are both based on a new characterization of feasible periodic schedules. A polynomial-time algorithm is deduced to evaluate the maximum throughput of a periodic schedule, providing a lower bound of the maximum throughput of the CSDFG. A new model for the optimization of the buffer sizes with a minimum required throughput based on integer linear programming is also developed, leading to a new algorithm to solve it approximately. Our algorithms are successfully compared with other academic solutions through representative benchmarks.

I.

I NTRODUCTION

The Synchronous Dataflow Graph [20] (SDFG in short) is a popular formalism that has been used for many years in the field of embedded system design such as Digital Signal Processing. However, this model is too coarse to reflect the communications between actors in many applications. Bilsen et al. [8] introduced Cyclo-Static Dataflow Graphs (CSDFGs in short) which is considered to be a more accurate model to address this problem. Actors’ exchanges are then more detailed. While SDFG assumes one schedule per actor, CSDFG actors iterate through a set of pre-defined phases. The consequence is that their analysis results are more pertinent and closer to what actually happens in real-life applications. CSDFGs are considered in many areas to model data exchanges between an application’s processes. They are usually automatically extracted from a suitable description of the application. In the field of synchronous languages, Mandel et al. developed a language, Lucy-n, that handles processes of different rates communicating through buffers [22]. The intermediate representation of this extended language is comparable to CSDFGs. CSDFGs are also considered to model embedded applications in order to map them on a parallel architecture. Several studies have been performed in an academic context [1], [2], [28]. Another example is the dataflow compiler designed to map a CSDFG on the Massively Parallel Processor Array (MPPA in short) developed by the Kalray company [18] that embeds 256 processors on a 28nm chip.

CSDFGs have many advantages. They are deterministic (as they can be seen as a subclass of Kahn networks [17]) and static in the sense that the volume of the communications is repetitive and fixed. While this staticity may prevent the modeling of any dynamic behavior, the benefit is that their associated basic decision problems are decidable. However, no polynomial-time algorithm exists up to now for checking the liveness, nor the computation of the maximum throughput of a schedule. The computation of the maximum throughput of an SDFG (and thus a CSDFG) is a fundamental problem. In view of the importance of its practical applications, many authors have addressed this problem. Roughly speaking, two classes of methods exist: exact ones [12], [13], [24], which provide the (exact) maximum value of the throughput within exponentialtime, or approximate ones that evaluate the maximum throughput by only considering a subset of the solutions such as the periodic schedules [3]. The particularity of these periodic schedules is that they constrain strictly periodic executions of tasks. This paper characterizes any feasible periodic schedule of a CSDFG with a polynomial set of linear equations, by extending the approach provided by [3] limited to SDFGs. A polynomialtime algorithm is derived to evaluate a lower bound of the maximum throughput. Its performances (execution time and quality of the bound) are experimentally compared with SDF3 [25] on an industrial benchmark. Although the use of a lower bound may sound insufficient, it often remains appreciated in the absence of an exact solution (due to high computation time) or in design space exploration. Several authors have noticed that periodic schedules allow to model the optimization of the total buffer size problem under a minimum throughput requirement using linear integer programming for SDFG [29], [4] or CSDFG [30], [5], [6]. The main advantage of this model is that it efficiently prunes the domain of exploration while keeping a relevant set of solutions. Our original characterization of periodic schedules of a CSDFG can thus be used to derive a new algorithm to approximately solve this optimization problem. Our approach is tested versus several other ones (namely Stuijk et al. [26], Benazouz et al. [6]). The present paper is organized as follows. Section II is dedicated to the presentation of CSDFG and of the problems considered. Related works are presented in Section III. The characterization of a periodic schedule of a CSDFG is developed in Section IV. Section V introduces the derived algorithms that are experimentally tested in Section VI. Section

VII is our conclusion. II.

S YNTAX

AND PROBLEM DEFINITION

This section is dedicated to a description of the problems considered and some important definitions. CSDFG model is first reviewed in Subsection II-A followed by the definition of a necessary condition of liveness (namely, the consistency). Subsection II-C concerns some definitions about scheduling. The two problems addressed in this paper are then explicitly formulated in Subsection II-D. A. Cyclo-Static Dataflow Graphs A Cyclo-Static Dataflow Graph (CSDFG) is a directed graph where nodes model macro-tasks, and arcs correspond to buffers. It is denoted by G = (T , A) where T (resp. A) is the set of nodes (resp. arcs). 1) Actors: Every actor (or macro-task) t ∈ T is decomposed into ϕ(t) ∈ N − {0} phases; for every value k ∈ {1, ⋯, ϕ(t)}, the kth phase of t is denoted by tk and has a constant duration d(tk ) ∈ R. One iteration of the actor t ∈ T corresponds to the ordered executions of the phases t1 , ⋯, tϕ(t) .

Moreover, every actor t ∈ T is executed several times: for every integer n ∈ N−{0} and for every phase k ∈ {1, ⋯, ϕ(t)}, ⟨tk , n⟩ denotes the nth execution of the kth phase of t.

For every couple (k, n) ∈ {1, ⋯, ϕ(t)}×N−{0}, P r⟨tk , n⟩ is the preceding execution phase of ⟨tk , n⟩. More formally, P r⟨tk , n⟩ = {

⟨tk−1 , n⟩ if k > 1 ⟨tϕ(t) , n − 1⟩ if k = 1

The execution ⟨tϕ(t) , 0⟩ is fictitious and is only introduced to simplify the definition of P r. A Synchronous Dataflow Graph can be seen as a special case of CSDFG where each actor t ∈ T has only one phase: ∀t ∈ T , ϕ(t) = 1.

2) Buffers: Every arc a = (t, t′ ) ∈ A represents a buffer b(a) of unbounded size from the actor t to t′ with an initial number of stored data, M0 (a) ∈ N. It is supposed that ∀k ∈ {1, ⋯, ϕ(t)}, ina (k) data are written in b(a) at the end of an execution of tk . Similarly, ∀k ′ ∈ {1, ⋯, ϕ(t′ )}, outa (k ′ ) data are read from b(a) before the execution of t′ k′ . Figure 1 shows an arc a between the two actors t and t′ . The respective number of phases of the two actors are ϕ(t) = 3 and ϕ(t′ ) = 2. The two associated vectors of a and its initial number of tokens are respectively ina = [2, 3, 1], outa = [2, 5], and M0 (a) = 0. Moreover, actor t is a stateful actor, which means that that two phases or two successive iterations of this macro-task cannot overlap. This is modeled by the loopback arc g. 1

t

Ia ⟨tk , n⟩ = Ia P r⟨tk , n⟩ + ina (k)

with the initialization value Ia ⟨tϕ(t) , 0⟩ = 0. Similarly, the number of data consumed by t′ in the buffer b(a) at the completion of ⟨t′ k′ , n′ ⟩ is defined by the sequence Oa ⟨t′ k′ , n′ ⟩ = Oa P r⟨t′ k′ , n′ ⟩ + outa (k ′ )

with the initialization Oa ⟨t′ ϕ(t) , 0⟩ = 0.

For the sake of clarity, we also define the following two integer values ia = Ia ⟨tϕ(t) , 1⟩ and oa = Oa ⟨t′ ϕ(t′ ) , 1⟩ to be equal to the amount of tokens respectively produced and consumed in a buffer b(a) during the entire iteration of actor t and t′ . The total number of tokens contained in a buffer must remain non negative, that is any execution ⟨t′ k′ , n′ ⟩ can be done at the completion of ⟨tk , n⟩ if and only if M0 (a) + Ia ⟨tk , n⟩ − Oa ⟨t′ k′ , n′ ⟩ ≥ 0. For example, considering the CSDFG pictured in Figure 1, the execution ⟨t′ 2 , 1⟩ can be done at the completion of ⟨t1 , 2⟩ because M0 (a) + Ia ⟨t1 , 2⟩ − Oa ⟨t′ 2 , 1⟩ = 0 + 8 − 7 ≥ 0. A CSDFG models buffers with an unlimited size, i.e. the number of data stored simultaneously in b(a) may be infinite. This hypothesis is clearly unacceptable for real-life systems. Stuijk et al. [26] noticed that a buffer b(a) with a bounded size from t to t′ may be modeled by adding a reverse arc a′ = (t′ , t) in the associated CSDFG with ∀k ∈ {1, ⋯, ϕ(t)}, outa′ (k) = ina (k) and ∀k ′ ∈ {1, ⋯, ϕ(t′ )}, ina′ (k ′ ) = outa (k ′ ). The size of the buffer is then equal to M0 (a) + M0 (a′ ). F b(A) represents the set of all the reverse arcs associated with A. Figure 2 illustrates the transformation. [2,3,1]

M0(a)

t [2,3,1]

a' M (a') 0

a

[2,5]

t' [2,5]

Figure 2. A bounded buffer b(a) between the two actors t and t′ . M0 (a) is the initial number of data stored in b(a) and M0 (a′ ) models the available space in this buffer; M0 (a) + M0 (a′ ) is then its total size.

Without loss of generality, it is assumed that the application is modeled using a connected CSDFG. Now, if all buffers have a bounded size, the graph obtained by adding reverse arcs is strongly connected (i.e. for every couple of macro-tasks (t, t′ ) ∈ T 2 , there exists a path from t to t′ ). B. Consistency of a CSDFG

[1,1,1]

g

Now, let us define Ia ⟨tk , n⟩ as the total number of data produced by t in the buffer b(a) at the completion of ⟨tk , n⟩. It can be recursively defined as:

[2,3,1]

a

0

[2,5]

t'

[1,1,1]

Figure 1. An simple arc a between two actors t and t′ with a loopback arc g on task t.

Consistency is a necessary (non sufficient) condition for the existence of a valid schedule within bounded memory that was established first for SDFGs [20]. It was extended to CSDFGs [8] by considering the cumulative number of tokens produced/consumed by one iteration of its actors.

Let us consider the pre-post ∣A∣ × ∣T ∣ matrix Γ associated with a CSDFG G defined by ′ ′ ⎧ ⎪ ⎪ ia if a = (t, t ), t ∈ T Γat = ⎨ −oa if a = (t′ , t), t′ ∈ T ⎪ ⎪ ⎩ 0 otherwise. The CSDFG is said to be consistent if the rank of Γ is ∣T ∣ − 1. Furthermore, any consistent CSDFG has a repetition vector q ∈ (N − {0})∣T ∣ such that ∀a = (t, t′ ) ∈ A, qt × ia = qt′ × oa .

The repetition vector defines the number of task fires in a sequence that preserves token quantities in each buffer. As an example, the repetition vector of the CSDFG presented in Figure 3 is q = [3, 4, 6]. [1,3]

[2]

A [3,5]

4

of the as soon as possible schedule for the CSDFG pictured in Figure 3. A1

A

C

0

[6]

0 [6,2,1]

[1,1,4]

B

Figure 3. A CSDFG with three actors and arcs. Actors’ durations are d(A) = [3, 1], d(B) = [2, 1, 2] and d(C) = [1].

Now, as consistency is a necessary condition of liveness with bounded buffers, our study focuses on consistent CSDFGs.

5

10

The throughput of an actor t ∈ T associated with a schedule S is usually defined as n T hSt = lim . n→∞ S⟨t1 , n⟩ Theorem 1 proved by Stuijk et al. [26] characterizes the relations between the throughput of different actors using the repetition vector. Theorem 1 ([26]): Let G = (T , A) be a consistent strongly connected CSDFG and S a valid schedule. For any couple of T hS T hS tasks (t, t′ ) ∈ T 2 , qtt = q ′t′ where q is the repetition vector t of G. The throughput of a valid schedule S is then equal to T hS S T hG = qtt for any actor t ∈ T .

The most common scheduling policy consists of executing the actors as soon as possible. Figure 4 presents first executions

20

25

A schedule S is called periodic if every actor t ∈ T is associated with a period µSt ∈ R+ − {0} such that ∀k ∈ {1, ⋯, ϕ(t)}, S⟨tk , n⟩ = S⟨tk , 1⟩ + (n − 1)µSt .

Then the periodic scheduling has additional constraints that make it less general and so more easy to define. Indeed, note that in this case, only the ϕ(t) first starting times of actor t ∈ T have to be defined. Moreover, the throughput of a periodic actor t is then exactly T hSt = µ1S . The throughput t of the schedule is then equal to T hSG = µS1qt for any actor t t ∈ T . The period of a periodic schedule S is then defined as 1 S S ΩG = T hS = µt qt . G

A1

A

A2 B1

A1 B2

B3

B1

C1

C

A feasible (or valid) schedule associated with a CSDFG is a function S that associates, for every triple (t, k, n) with t ∈ T , k ∈ {1, ⋯, ϕ(t)} and n ∈ N − {0}, a starting time S⟨tk , n⟩ ∈ R for the nth execution of tk such that the number of data in every buffer a ∈ A remains non negative, i.e. no data is read before it is produced.

15

A1 B1 B2 B3 C1 C1

The as soon as possible schedule maximizes the throughput. Nevertheless, its description may be of exponential size as it depends on the repetition vector rather than the problem size. The consequence is that other scheduling policies must be considered to reduce the computation time of the maximum throughput.

B

C. Schedules of CSDFGs

A2 B3

Figure 4. An as soon as possible feasible schedule for the CSDFG pictured in Figure 3.

0

C

A2 A1 A2 A1 B1 B2 B3 B1 B2 B3 B1 B2 C1 C1 C1 C1

B

0

5

A2 B2 C1

10

A1 B3

B1 C1

15

B2

A2 B3 C1

B1

B2 C1

20

Figure 5. A periodic feasible schedule for the CSDFG pictured in Figure 3. Highlighted executions correspond to initial starting times, others are derived from the period µS t .

Figure 5 presents a periodic schedule for the CSDFG from Figure 3. Observe that µSA = 8, µSB = 6 and µSC = 4 and thus µSA × qA = µSB × qB = µSC × qC = 24. The throughput and 1 and the period of the schedule S are respectively T hSG = 24 S ΩG = 24. The throughput of a CSDFG is maximized for the as soon as possible scheduling policy. It is usually no longer true for periodic schedules, i.e. the maximum throughput of a periodic schedule can be seen as a lower bound of the maximum reachable throughput. D. Problem formulation The theoretical aim of this paper is to characterize the structure of any feasible periodic schedule for a CSDFG. This work will be considered to efficiently solve two practical problems, namely the evaluation of the maximum throughput of a CSDFG and the minimization of the buffer sizes under a throughput constraint.

1) Evaluation of the maximum throughput: Let’s consider a strongly connected CSDFG G = (T , A). The first problem addressed by the present paper is to evaluate the maximum reachable throughput of G. 2) Minimizing buffers under a throughput constraint: Let’s consider a connected CSDFG G = (T , A). Each arc a = (t, t′ ) ∈ A is associated with a feedback one a′ = (t′ , t) ∈ F b(A); the size of the corresponding buffer b(a) equals M0 (a) + M0 (a′ ).

Setting θ(a) = θ(a′ ) to be the size of a data stored in b(a), the size of this buffer is exactly M0 (a)θ(a) + M0 (a′ )θ(a′ ). The whole size of G is thus ∑a∈(A∪F b(A)) θ(a)M0 (a).

The optimization problem considered may be stated as follows: Let T h⋆G be the minimum throughput required. The problem consists of finding integer values M0 (a′ ) for a′ ∈ F b(A) such that 1) 2)

The throughput of the CSDFG is at least T h⋆G ; The whole buffer size is minimum. III.

R ELATED WORK

The throughput evaluation of an SDFG (or a CSDFG) within a reasonable time is a difficult question from a theoretical point of view. This question is well solved for Homogeneous SDFG (for which oa = ia = 1 for any a ∈ A) for a long time [10]. One can note in this case that periodic schedules reach the maximum throughput [23]. Several exact approaches of exponential time were developed for general SDFG or CSDFG. Ghamarian et al. [13] and Stuijk et al. [26] proposed to perform a symbolic execution until reaching an already known state. Lee et al. [20] showed that any SDFG may be transformed into an equivalent Homogeneous one. The problem can then be solved on this transformed graph using a polynomial-time algorithm (but with an HSDFG structure of exponential size). An exact method based on Max-plus algebra was recently proposed by de Groote et al. [12] for SDFGs and seems to be much faster than previous ones. Nevertheless, up to our knowledge its extension to CSDFGs is not available at this time. Another way to compute the throughput consists of limiting the study to periodic schedules in order to get a polynomialtime algorithm. Benabid et al. [3] proved that any periodic schedule of an SDFG may be characterized using Θ(∣A∣) linear equations on the starting times of the first execution of the actors. It is also proved that the throughput achieved with this scheduling policy will remain a lower bound of the maximum throughput. A first periodic approach for throughput evaluation of CSDFGs was provided by Bamakhrama et al.[2]. They restricted their study to the special case of acyclic graphs, with no consideration for the buffer size. Our approach can be seen as the theoretical extension to the CSDFG of the work from Benabid et al. [3]. General graph structures with bounded or unbounded buffers can be considered, and thus any real-life application modeled using a CSDFG. The minimization of the buffer sizes under a throughput constraint was intensively studied because of the importance

of the applications. This problem is initially known to be NPHard even for Homogeneous SDFG [7]. A first exact method based on the exploration of all possible buffer sizes was developed and tested by Stuijk et al. in [26] for CSDFG. This exact method is unfortunately not applicable for most industrial problems that we tested because of its time complexity; the throughput evaluation used is itself an exponential algorithm. Another way consists of limiting the considered schedules to periodic ones in order to simplify the optimization problem. Several authors considered subclasses of periodic schedules to obtain a number of linear constraints independent of the number of phases. Wiggers et al. [30] observed that a subclass of periodic schedules can be characterized by linear equations on the start times of actors’ first iterations. Periodic sequences of execution are defined a priori for each actor, then by computing a valid periodic scheduling with these execution sequences, buffer sizes are derived algorithmically. The overall algorithm is of polynomial complexity. Benazouz et al. [5] observed that, when the phases start times are fixed in this way, the entire problem can be defined as a linear program. Different phase scheduling policies are then proposed (namely Burst policy, for which all the phases of an actor are executed without interruption, or Average policy, for which the phases are distributed evenly over a period). More recently, Benazouz et al. [6] proved that the determination of the phases’ execution policies may also be modeled using linear programs, leading to a two step method (called Min-max algorithm in the following). The Min-max algorithm showed better performance in different test sets. Nevertheless, the proposed policy computation time is of quadratic complexity in contrast to the one proposed by Wiggers [30] which is linear. The methodology presented in the present paper differs from these previous approaches essentially in the fact that there is no scheduling policy anymore since the starting time of phase executions are part of the problem and are therefore not defined a priori. Our method considers a larger set of feasible schedules, since the schedule of the phases of any macro-task is not previously fixed. Actually, we now consider all the existing periodic schedules. The main advantage is that the expected solutions are better, since the set of feasible solutions is bigger. The equations are simpler, and the method is composed of only one step. The only drawback is that the number of equations will now depend on the total number of phases ∑t∈T ϕ(t), which may be high for real-life problems. The execution time of our method is thus expected to be much longer. Since our goal is geared more towards an efficient result, in the experiments, we compared our methods with the Min-max algorithm [6] and the exact method from SDF3 [26]. IV.

F EASIBLE

PERIODIC SCHEDULE

This section is devoted to the characterization of the set of feasible periodic schedules of a strongly connected CSDFG. Our algorithms developed to practically solve the two considered problems are based on the analytical equations obtained. Subsection IV-A reviews a first analytical definition of the precedence constraints associated with any arc of a CSDFG. An original description of the set of the precedence

constraints is exhibited in Subsection IV-B; Then Subsection IV-C provides an analytical description of any feasible periodic schedule. A. Definition of precedence constraints The infinite set of constraints induced by an arc a = (t, t′ ) on the phase executions of macro-tasks t and t′ can be expressed as classical precedence constraints. More formally, a is said to induce a precedence constraint from ⟨tk , n⟩ to ⟨t′ k′ , n′ ⟩ if the two following conditions hold: 1) ⟨t′ k′ , n′ ⟩ can only be executed at the completion of ⟨tk , n⟩ 2) P r⟨t′ k′ , n′ ⟩ can be executed before the end of ⟨tk , n⟩.

Let us consider the arc a presented in Figure 1 to illustrate this point. Since M0 (a) + Ia ⟨t1 , 2⟩ − Oa ⟨t′ 2 , 1⟩ ≥ 0, ⟨t′ 2 , 1⟩ can be executed at the completion of ⟨t1 , 2⟩. Now, M0 (a) + Ia ⟨t3 , 1⟩ − Oa ⟨t′ 2 , 1⟩ = 0 + 6 − 7 < 0. The consequence is that ⟨t′ 2 , 1⟩ must wait for the completion of ⟨t1 , 2⟩, and thus can only be executed at its completion (Condition 1). Now, as M0 (a) + Ia ⟨t3 , 1⟩ − Oa ⟨t′ 1 , 1⟩ = 0 + 6 − 2 ≥ 0, ⟨t′ 1 , 1⟩ can be executed before the end of ⟨t1 , 2⟩ (Condition 2). The existence of a precedence relation from ⟨t1 , 2⟩ to ⟨t′ 2 , 1⟩ is deduced.

Proof: By Lemma 1, the existence of a precedence constraint from ⟨tk , n⟩ to ⟨t′ k′ , n′ ⟩ is equivalent to ina (k) > M0 (a) + Ia ⟨tk , n⟩ − Oa ⟨t′ k′ , n′ ⟩ ≥ Ha (k, k ′ ).

Now, since Ia ⟨tk , n⟩ = (n − 1)ia + Ia ⟨tk , 1⟩, Oa ⟨t′ k′ , n′ ⟩ = (n′ − 1)oa + Oa ⟨t′ k′ , 1⟩ we get that Ia ⟨tk , n⟩ − Oa ⟨t′ k′ , n′ ⟩ = α + Ia ⟨tk , 1⟩ − Oa ⟨t′ k′ , 1⟩. 1)

Oa ⟨t′ k′ , 1⟩ − (Ia ⟨tk , 1⟩ − ina (k)) − M0 (a) − 1 ≥ α. Now, since Ia ⟨tk , 1⟩ = Ia P r⟨tk , 1⟩ + ina (k), we get that Oa ⟨t′ k′ , 1⟩ − Ia P r⟨tk , 1⟩ − M0 (a) − 1 ≥ α.

The following lemma initially proved by Benazouz et al. in [5] provides a mathematical criterion that expresses this intuitive definition of a precedence constraint between two executions. Lemma 1 ([5]): Let a = (t, t′ ) ∈ A, a couple (k, k ′ ) ∈ {1, ⋯, ϕ(t)} × {1, ⋯, ϕ(t′ )} and the integer value Ha (k, k ′ ) = max{0, ina(k) − outa (k ′ )}. For any couple (n, n′ ) ∈ (N − {0})2 , there exists a precedence constraint from ⟨tk , n⟩ to ⟨t′ k′ , n′ ⟩ if and only if ina (k) > M0 (a) + Ia ⟨tk , n⟩ − Oa ⟨t′ k′ , n′ ⟩ ≥ Ha (k, k ′ ).

We get for our previous example Ha (1, 2) = max{0, 2 − 5} = 0 and M0 (a) + Ia ⟨t1 , 2⟩ − Oa ⟨t′ 2 , 1⟩ = 1. The inequality 2 > 1 ≥ 0 is then checked. B. Characterization of precedence constraints For any couple of values (α, γ) ∈ Z×N−{0}, let’s consider γ γ ⌊α⌋ and ⌈α⌉ defined as: α α γ γ ⌊α⌋ = ⌊ ⌋ × γ and ⌈α⌉ = ⌈ ⌉ × γ. γ γ

The left inequality then becomes

2)

As α is divisible by gcda , the αmax (k, k ′ ) ≥ α holds. a Similarly, the right inequality becomes

inequality

α ≥ Ha (k, k ′ ) + Oa ⟨t′ k′ , 1⟩ − Ia ⟨tk , 1⟩ − M0 (a),

′ and thus α ≥ αmin a (k, k ), which concludes the proof.

′ A simple consequence of Lemma 2 is that, if αmin a (k, k ) > ′ there is no couple of integers (n, n ) such that a induces a precedence constraint from ⟨tk , n⟩ to ⟨t′ k′ , n′ ⟩.

αmax (k, k ′ ), a

Let’s consider, as an example, the arc a = (A, B) pictured max in Figure 3. As αmin (1, 2) = 0, then a (1, 2) = 2 and αa min max αa (1, 2) > αa (1, 2) and there are no precedence constraints induced by a between ⟨t1 , n⟩ and ⟨t′ 2 , n′ ⟩ regardless of the values for n and n′ . Lemma 3 is the converse of Lemma 2.

We also note, for every arc a = (t, t′ ) ∈ A, gcda = gcd(ia , oa ). The following technical lemma expresses a basic property of the structure of the precedence constraint: Lemma 2: Let us consider an arc a = (t, t′ ) ∈ A, a couple (k, k ′ ) ∈ {1, ⋯, ϕ(t)} × {1, ⋯, ϕ(t′ )} and two executions ⟨tk , n⟩ and ⟨t′ k′ , n′ ⟩ with (n, n′ ) ∈ (N − {0})2 . Let us also note α = (n − 1) × ia − (n′ − 1) × oa .

If a induces a precedence constraint from ⟨tk , n⟩ to ′ max ⟨t k′ , n′ ⟩ then αmin (k, k ′ ) with a (k, k ) ≤ α ≤ αa ′

′ ′ ′ αmin a (k, k ) = ⌈Ha (k, k ) + Oa ⟨t k′ , 1⟩ − Ia ⟨tk , 1⟩ − M0 (a)⌉

gcda

gcda

Proof: By Bezout, there exists (x, y) ∈ (Z − {0})2 such that x × ia − y × oa = gcda . Now, for any integer z, values n(z) = 1 + x ×

z × oa and n (z) = 1 + y × ′

αmax (k,k′ ) a gcda

.

αmax (k,k′ ) a gcda

+ z × ia are such that

+

x × ia − y × oa max αa (k, k ′ ) gcda = αmax (k, k ′ ). a

(n(z) − 1)ia − (n′ (z) − 1)oa =

and αmax (k, k ′ ) = ⌊Oa ⟨t′ k′ , 1⟩ − Ia P r⟨tk , 1⟩ − M0 (a) − 1⌋ a

Lemma 3: Let us consider an arc a = (t, t′ ) ∈ A, and a couple (k, k ′ ) ∈ {1, ⋯, ϕ(t)} × {1, ⋯, ϕ(t′ )} with ′ max αmin (k, k ′ ). There exists an infinite number of a (k, k ) ≤ αa ′ couples (n, n ) ∈ (N − {0})2 such that a induces a precedence constraint from ⟨tk , n⟩ to ⟨t′ k′ , n′ ⟩.

For z sufficiently large, n(z) and n′ (z) are positive. Let A(z) = M0 (a) + Ia ⟨tk , n(z)⟩ − Oa ⟨t′ k′ , n′ (z)⟩. We prove that Ha (k, k ′ ) ≤ A(z) < ina (k). Indeed,

●

A(z) = M0 (a) + Ia ⟨tk , 1⟩ − Oa ⟨t′ k′ , 1⟩ + (n(z) − 1)ia − (n′ (z) − 1)oa = M0 (a) + Ia ⟨tk , 1⟩ − Oa ⟨t′ k′ , 1⟩ + αmax (k, k ′ ). a

By definition of αmax (k, k ′ ), a

αmax (k, k ′ ) < −M0 (a) − Ia P r⟨tk , 1⟩ + Oa ⟨t′ k′ , 1⟩. a

Thus, A(z)
60h 210 7sec 40816 499sec >60h 591 4928sec >60h

N/S : There is no periodic solution.

The lower part of Table II lists applications with buffers sized by using a greedy algorithm [7] which ensures liveness. Computation times using SDF3 increase dangerously. Even if SDF3 can still be considered to precisely evaluate the throughput for some isolated cases, it cannot be safely used in

C. Throughput-Buffering Trade-off This section aims to test our method for the minimization of the buffer sizes. A minimum fixed throughput is usually required. The method is tested against the Min-max algorithm [6]. PSizing-LP, PSizing-ILP and Min-max methods are first tested for different minimum throughput values. They are also compared to the computation of Pareto fronts (throughput/total buffer size) for the JPEG2000. We no longer compare our methods with SDF3 or other methods. The motivation for this choice is that the Min-max algorithm is to our knowledge the only algorithm comparable (in terms of time complexity, throughput evaluation and buffer size) to our approach. Results concerning SDF3 were not reported for both experiments since no results were provided within a reasonable time for any application tested (more than 24 hours for the buffer sizing computation, more than a week for the Pareto front of the JPEG2000). 1) Buffer Sizing for Throughput requirement: Table III summarizes our experiments. The two first columns show the performance of respectively approximate algorithms Min-max and PSizing-LP. The last column concerns the optimal integer solutions for PSizing-ILP.

Nevertheless, we observe that PSizing-LP provides much more accurate results for the buffer sizing in most cases. By fixing a locally optimal phase scheduling policy, the Min-max is not able to process the H264 Encoder application, even if a periodic schedule exists (and is constructed by our algorithm). Lastly, by comparing the two last columns, the difference between the optimal solutions of PSizing-LP and PSizing-ILP is always less than 10 percent. The consequence is that the relaxation is particularly effective with regards to computation times. 2) Throughput-Buffering Trade-off Exploration: This experiment deals with the achievement of a Pareto front for JPEG2000. The generation of an approximative Pareto front is possible using the Min-max, PSizing-LP and PSizing-ILP algorithms by computing minimum buffer sizes for successive values of the minimum throughput required. We were not able to compare our results with SDF3 [25]. Due to the optimality of the throughput, this method should provide more accurate results. Our main problem is that it does not converge within a week, and thus no results are available. Total Buffer Size (MB)

an iterative optimization process, nor in a commercial tool. Conversely, the experimental complexity of the maximum throughput evaluation of a periodic schedule remains quite low with bounded buffers. The main reason is that the size of a feasible schedule is not impacted by this limitation. The weaknesses of this method are clearly the possible deviation from the maximal throughput and the non existence of a periodic schedule (such as for JPEG2000).

4.2 4.1 4 3.9 3.8 3.7 3.6 3.5 3.4

T h⋆G

The maximum throughput evaluation was previously computed using our throughput evaluation algorithm. Throughput requirement values are then successively fixed to 0.1×T h⋆G , 0.5 × T h⋆G and T h⋆G . B UFFER SIZING ( IN KB YTE ) FOR 3 THROUGHPUT ⋆ ⋆ REQUIREMENT VALUES (0.1 × T h⋆ G , 0.5 × T hG AND T hG ) AND THE OVERALL COMPUTATION TIME .

Figure 7.

0

20 40 60 80 Throughput performance (%) Min Max PSizing-ILP PSizing-LP

100

Throughput-buffering exploration for the JPEG2000 application.

Table III.

Application BlackScholes

Echo

H264 Encoder

JPEG2000

Pdectect

Min-max [6] 16332 16332 77ms 22572 28098 28101 40ms 28115 no solution no solution 3sec no solution 3936365 4027536 1sec 4153356 4264311 4327687 5sec 5375166

PSizing-LP 16332 16332 155ms 22572 28098 28101 79ms 28115 1369256 1369256 5sec 1369271 3635687 3733367 2sec 3864471 3959031 4123187 223sec 5191721

PSizing-ILP 16332 16332 120ms 22572 28098 28101 60ms 28113 1369257 1369256 32sec 1369271 3502087 3600295 5min 3725351 3958311 3958351 10h 5068006

As expected, the Min-max algorithm is always the fastest; this is because the number of equations considered in its final linear program is reduced drastically due to a pre-processing. The difference with our method is particularly notable for the pedestrian detection algorithm for which the total number of phases is high (4045 phases for 58 actors).

Figure 7 summarizes our experiments restricted to methods based on periodic schedules. The horizontal axis is a percentage of the maximum throughput T h⋆G while the vertical axis is the buffer size required to reach it. Eleven test points were considered between 0 and T h⋆G . Best solutions are always obtained for PSizing-ILP algorithm, which computes the optimum minimum value for the buffer size (but with a non-polynomial time complexity). The PSizing-LP algorithm is always better than the Minmax, which confirms the previous experiments. Note that this difference does not depend on the minimum throughput required. Our conclusion is that all mentioned methods have their own advantages. Min-max is a fast buffer sizing method. For a more accurate result within a reasonable time, PSizing-LP can be considered. These two methods can be implemented in a real-life context because of their efficiency. SDF3 and PSizing-ILP are both non polynomial-time methods and they may not give a pertinent result within a reasonable

time. However, PSizing-ILP seems to be much faster, but the result obtained using SDF3 (if any) should be more accurate. VII.

C ONCLUSION

This paper has presented a characterization of feasible periodic schedules associated with a CSDFG. Two original algorithms were deduced to approximately solve the evaluation of the maximum throughput of a CSDFG and the buffer sizing with a throughput constraint. Each of them was tested using representative benchmarks. Our method for evaluating the throughput is fast, and provides a good evaluation of the throughput for our industrial applications. However, the throughput computed for instances with bounded buffers seems to be quite far from the optimal (non periodic schedules), and improvements have to be sought. Our conclusion about buffer sizing is that the restriction to periodic schedules is a good alternative to symbolic execution for real-life applications. These methods are fast enough to be implemented in a real-life system and provide good solutions. We show in this context that PSizing-LP is much slower than Min-max [6] but the solution is frequently more accurate. A possible improvement should be reached by extending our equations to more complex scheduling policies: Bodin et al. [9] proved that the throughput evaluation of an SDFG may be improved by considering several iterations of actors. This approach is potentially adaptable to CSDFG. As the size of the scheduling could also affect the total memory usage of the system, part of this future work will focus on the integration of schedule size as an optimization constraint. Finally, another interesting perspective is to extend our study to explore more expressive models such as Computation Graph [19] or scenarios as defined for the SADF [27]. R EFERENCES [1]

[2]

[3]

[4]

[5]

[6]

[7]

Benny Akesson, Sander Stuijk, Anca Molnos, Martjin Koedam, Radu Stefan, Andrew Nelson, Ashkan Beyranvand Nejad, and Kees Goossens. Virtual platforms for mixed time-criticality applications: The CoMPSoC architecture and SDF3 design flow. In Quo Vadis, Virtual Platforms?(QVVP), pages 1–2, 2012. Mohamed a. Bamakhrama, Jiali Teddy Zhai, Hristo Nikolov, and Todor Stefanov. A methodology for automated design of hard-real-time embedded streaming systems. Design, Automation & Test in Europe (DATE), pages 941–946, March 2012. Abir Benabid, Claire Hanen, Olivier Marchetti, and Alix MunierKordon. Periodic Schedules for Bounded Timed Weighted Event Graphs. IEEE Transactions on Automatic Control, 57(5):1222 – 1232, 2012. Mohamed Benazouz, Olivier Marchetti, Alix Munier-Kordon, and Pascal Urard. A new approach for minimizing buffer capacities with throughput constraint for embedded system design. In AICCSA ’10, International Conference on Computer Systems and Applications, 2010. Mohamed Benazouz, Olivier Marchetti, Alix Munier-Kordon, and Pascal Urard. A new method for minimizing buffer sizes for Cyclo-Static Dataflow graphs. In Embedded Systems for Real-Time Multimedia (ESTIMedia), pages 11–20, 2010. Mohamed Benazouz and Alix Munier-Kordon. Cyclo-static DataFlow phases scheduling optimization for buffer sizes minimization. In Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems - M-SCOPES ’13, page 3, New York, New York, USA, 2013. ACM Press. Shuvra S. Bhattacharyya, Edward A. Lee, and Praveen K. Murthy. Software Synthesis from Dataflow Graphs. Kluwer Academic Publishers, Norwell, MA, USA, 1996.

[8] Greet Bilsen, Marc Engels, Rudy Lauwereins, and J.A. Peperstraete. Cyclo-static data flow. IEEE Transactions on Signal Processing, pages 3255–3258, 1995. [9] Bruno Bodin, Alix Munier-Kordon, and Benoît Dupont de Dinechin. KPeriodic Schedules for Evaluating the Maximum Throughput of a Synchronous Dataflow Graph. In International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS XII, pages 152–159, 2012. [10] Philippe Chrétienne. Transient and limiting behavior of timed event graphs. RAIRO Techniques et Sciences Informatiques, 4:127–192, 1985. [11] Ali Dasdan, Sandy S. Irani, and Rajesh K. Gupta. Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems. Design Automation Conference (DAC’99), pages 37–42, 1999. [12] Robert de Groote, Jan Kuper, Hajo Broersma, and Gerard J.M. Smit. Max-Plus Algebraic Throughput Analysis of Synchronous Dataflow Graphs. 2012 38th Euromicro Conference on Software Engineering and Advanced Applications, pages 29–38, September 2012. [13] Amir Hossein Ghamarian, Marc Geilen, Sander Stuijk, Twan Basten, Arno Moonen, Marco J.G. Bekooij, Bart D. Theelen, and MohammadReza Mousavi. Throughput Analysis of Synchronous Data Flow Graphs. In International Conference on Application of Concurrency to System Design (ACSD’06), pages 25–36, 2006. [14] Michel Gondran and Michel Minoux. Graphs and algorithms. John Wiley and sons, first edition, 1984. [15] IBplusAGB5CSDF. http://tinyurl.com/IBplusAGB5CSDF, 2013. [16] Gurobi Optimization Inc. Gurobi Optimizer Reference Manual, 2013. [17] Gilles Kahn. The semantics of a simple language for parallel programming. Information processing, 1974. [18] Kalray. Manycore processors for embedded computing. www.kalray.eu. [19] Richard M. Karp and Raymond E. Miller. Properties of a model for parallel computations: Determinancy, termination, queueing. SIAM Journal on Applied Mathematics, 14(6):1390–1411, 1966. [20] Edward A. Lee and David G. Messerschmitt. Synchronous dataflow. Proceedings of the IEEE, 75(9):1235–1245, 1987. [21] R. Lougee-Heimer, M.J. Saltzman, and T.K. Ralphs. The COINOR Initiative: Open-source Software Accelerates Operations Research Progress, 2001. [22] Louis Mandel, Florence Plateau, and Marc Pouzet. Lucy-n: a nsynchronous extension of Lustre. Mathematics of Program Construction, 2010. [23] Raymond Reiter. Scheduling parallel computations. Journal of the ACM, 15(4):590–599, 1968. [24] S. Sriram and Shuvra S. Bhattacharyya. Embedded multiprocessors: Scheduling and synchronization. CRC, 2009. [25] Sander Stuijk, Marc Geilen, and Twan Basten. SDFˆ3: SDF For Free. In Sixth International Conference on Application of Concurrency to System Design (ACSD’06), pages 276–278. Ieee, 2006. [26] Sander Stuijk, Marc Geilen, and Twan Basten. Throughput-Buffering Trade-Off Exploration for Cyclo-Static and Synchronous Dataflow Graphs. IEEE Transactions on Computers, 57(10):1331–1345, 2008. [27] Bart D. Theelen, Marc Geilen, and Twan Basten. A scenario-aware data flow model for combined long-run average and worst-case performance analysis. and Models for Co-, pages 185–194, 2006. [28] William Thies, Michal Karczmarek, and Saman Amarasinghe. StreamIt: A language for streaming applications. Compiler Construction, pages 179–196, 2002. [29] Maarten H. Wiggers, Marco J.G. Bekooij, Pierre Jansen, and Gerard Smit. Efficient computation of buffer capacities for multi-rate realtime systems with back-pressure. Proceedings of the 4th international conference on Hardware/software codesign and system synthesis CODES+ISSS ’06, page 10, 2006. [30] Maarten H. Wiggers, Marco J.G. Bekooij, and Gerard J.M. Smit. Efficient computation of buffer capacities for cyclo-static dataflow graphs. Proceedings of the 44th annual conference on Design automation - DAC ’07, (1):658, 2007.