Stochastic Allocation and Scheduling for Conditional ... - AI @ UniBO

2 downloads 16433 Views 1MB Size Report
Future Multi-Processor System-on-Chips (MPSoCs) hosting a huge number of pro- cessors .... time), but it has no dedicated logic for dynamic swapping of contents. While a ...... The cost of such a procedure is not affordable for hardware design.
Noname manuscript No. (will be inserted by the editor)

Stochastic Allocation and Scheduling for Conditional Task Graphs in Multi-Processor Systems-on-Chip Michele Lombardi · Michela Milano · Martino Ruggiero · Luca Benini

Received: date / Accepted: date

Abstract Embedded systems designers are turning to multicore architectures to satisfy the ever-growing computational needs of applications within a reasonable power envelope. One of the most daunting challenges for MultiProcessor System-on-Chip (MPSoC) platforms is the development of tools for efficient mapping multi-task applications onto hardware platforms. Software mapping can be formulated as an optimal allocation and scheduling problem, where the application is modeled as a task graph, the target hardware is modeled as a set of heterogeneous resources, and the objective function represents a design goal α (e.g. minimum execution time, minimum usage of communication resources, etc.). Conditional task graphs, where inter-task edges represent data as well as control dependencies, are a well-known computational model to describe complex real-life applications where alternative execution paths, guarded by conditionals, can be specified. Each condition has a probability associated with each possible outcome. Michele Lombardi DEIS Universita’ di Bologna Viale del Risorgimento 2, 40136 Bologna, Italy Tel.: +390512093890 E-mail: [email protected] Michela Milano DEIS Universita’ di Bologna Viale del Risorgimento 2, 40136 Bologna, Italy Tel.: +390512093790 E-mail: [email protected] Martino Ruggiero DEIS Universita’ di Bologna Viale del Risorgimento 2, 40136 Bologna, Italy Tel.: +390512093784 E-mail: [email protected] Luca Benini DEIS Universita’ di Bologna Viale del Risorgimento 2, 40136 Bologna, Italy Tel.: +390512093782 E-mail: [email protected]

2

Mapping conditional task graphs is significantly more challenging than mapping pure dataflow graphs (where edges only represent data dependencies). Approaches based on general-purpose complete solvers (e.g. Integer Linear Programming solvers) are limited both by computational blowup and by the fact that the objective is a stochastic functional. The main contribution of our work is an efficient and complete approach to allocation and scheduling of conditional task graphs, based on (i) an exact analytic formulation of the stochastic objective function exploiting task graph analysis and (ii) an extension of the timetable constraint for conditional activities. Moreover, our solver is integrated in a complete application development environment which produces executable code for target multi-core platforms. This integrated framework allows us to validate modeling assumptions and to assess constraint satisfaction and objective function optimization. Extensive validation results demonstrate not only that our approach can handle non-trivial instances efficiently, but also that our models are accurate and lead to optimal and highly predictable execution.

1 Introduction The last five years witnessed a shift in the design of integrated architectures from single processor systems to multiprocessors. This revolution is motivated by the evidence that traditional approaches to maximize single processor performance have reached their limits. Power consumption, which has stopped the race in boosting clock frequencies for monolithic processor cores [33] [43] [50] [32] and design complexity are the most critical factors that limit performance scaling [46]. However, Moore’s law continues to enable doubling in the number of transistors on a single die every 18-24 months. Consequently, designers are turning to multicore architectures to satisfy the ever-growing computational needs of applications within a reasonable power envelope [45]. Following Moore’s law, the semiconductor industry roadmap foresees a doubling in the number of core units per die with every technology generation [20]. This trend is noticeable both in mainstream [15] [54] [8], as well as in embedded computing [55] [4] [7] [57] [56]. Future Multi-Processor System-on-Chips (MPSoCs) hosting a huge number of processors will provide scalable computational power thanks to their massive parallelism, but they require adequate programming models and software development infrastructure. This is where the next technology revolution is expected. Software running on embedded multiprocessors must be high performance, real-time and low power, while on the other hand consumer applications are characterized by tight time-to-market constraints and extreme cost sensitivity. Hence, there is a strong need for software optimization tools that can optimally exploit the available resources [14]. Many embedded applications exhibit significant concurrency at different levels of granularity. Task level concurrency (at the level of functions and procedures) is often evident in data-intensive embedded applications, and it is the main focus of this work. The key to successful application deployment lies in effectively mapping concurrency in the application to the architectural resources provided by the platform. The simplest form of task-level concurrency can be captured by data-flow graphs, which are often used for pure data-streaming applications. However many real-life applications can only be specified using non-trivial control flow, which can be captured by conditional task graphs, where alternative execution paths, guarded by conditionals can be specified. Each condition has a probability associated with each possible outcome. The problem of allocating and scheduling conditional task graphs on processors in a distributed

3

real-time system is obviously NP-hard, and it is considered much more challenging than data-flow graph mapping, given the non-deterministic nature of cost functions and constraints. In addition, the intricacies of component interactions in multicore architectures call for detailed system models and for their validation on a real or virtual platform [48]. In this paper we tackle the problem of specifying and mapping applications modeled as conditional task graphs on multi-processor systems-on-chip. We target a generic architectural template for distributed memory architectures. We include in our architectural model not only processors, but also communication and storage resources, which are managed by the mapping algorithm. We propose a constraint-based allocation and scheduling algorithm leveraging the Logic based Benders Decomposition (LBD)[21] schema. The problem we face is the allocation of tasks to processors, memory requirements to storage devices and communication requirements to the platform interconnects so as to minimize the expected value of the communication cost, and we schedule the overall application along the time line. Adopting an LBD approach enables the use of heterogeneous techniques for the two problem components, so that an ILP solver can be used to efficiently find an optimal allocation, whereas Constraint Programming can be exploited to compute a feasible schedule. Taking into account conditional task graphs in constraint based allocation and scheduling implies the development of two extensions to traditional solvers: first we need an efficient method for reasoning on task probabilities. We define two data structures, the activation set of a node and the coexistence set of two nodes, used to derive task probabilities in polynomial time. Second, we need to extend traditional constraints to take into account the feasibility in all scenarios. We have developed the concept of conditional constraints and we have applied it to the timetable constraint. The propagation is polynomial if the task graph satisfies a property called Control Flow Uniqueness which is common in conditional task graphs for system design. To our knowledge, our approach is the first complete (i.e. exact) algorithm for conditional task graph allocation and scheduling that handles instances of practical size (tens of tasks). It is important to stress the fact that an optimal solver is useful in practice only if it is supported by the static (design-time) and dynamic (run-time) software infrastructure required to deploy the applications on the target platform. This is a critical and non-trivial task, as we must guarantee that actual execution accurately matches in time and space the solution computed by the optimizer. Our solver is integrated within a complete application development environment, which includes a user frontend and middleware libraries and APIs for the deployment on target MPSoC hardware platforms. The middleware manages OS-level orchestration issues, such as inter-task communication and synchronization, initialization, execution control etc. We leverage the application development environment to deploy the optimal mappings to an MPSoC target to check modelling assumptions and make sure that second-order effects and/or modelling approximations impact only marginally the quality of the solution computed by the optimizer. The structure of this work is as follows. Section 2 illustrates previous work. Section 3 presents the target architecture while high-level target application and system models are described in Section 4. The integration of our entire framework in a software optimization methodology for MPSoCs is described in Section 5. Our application development support is illustrated in Section 6. The combined solver for the mapping problem is described in Section 7. Section 8 finally shows the computation efficiency

4

of our optimization solver and the validation of optimal solutions on a cycle accurate MPSoC virtual platform.

2 Related Work The synthesis of system architectures has been extensively studied in the past. Mapping and scheduling problems of non conditional graphs on multi-processor systems have been traditionally tackled by means of Integer Linear Programming (ILP). In general, even though ILP is used as a convenient modeling formalism, there is consensus on the fact that pure ILP formulations are suitable only for small problem instances, because of their high computational cost. An early example is represented by the SOS system, which used mixed integer linear programming technique (MILP) [49]. A MILP model that allows to determine a mapping optimizing a trade-off function between execution time, processor and communication cost is reported in [1]. The complexity of pure ILP formulations for general task graphs has led to the development of heuristic approaches. Heuristic approaches provide no guarantees about the quality of the final solution, and many times the need to bound search times limits their applicability to moderately small task sets. In [26] a retiminig heuristic is used to implement pipelined scheduling, while simulated annealing is used in [41]. A comparative study of well-known heuristic search techniques (Genetic Algorithms, Simulated Annealing and Tabu Search) is reported in [16]. A scalability analysis of these algorithms for large real-time systems is introduced in [47]. Many heuristic scheduling algorithms are variants and extensions of list scheduling [37]. Constraint Programming (CP) is an alternative approach to Integer Programming for optimally solving combinatorial optimization problems [23]. The work in [42] is based on Constraint (Logic) Programming to represent system synthesis problem, and leverages a set of finite domain variables and constraints imposed on these variables. Both ILP and CP techniques can claim individual successes but practical experience indicates that neither approach dominates the other in terms of computational performance. The development of a hybrid CP-IP solver that captures the best features of both would appear to offer scope for improved overall performance. However, the issue of communication between different modeling paradigms arises. One method is inherited from the Operations Research and is known as Benders Decomposition [17]: it has been proven to converge producing the optimal solution. There are a number of papers using Benders Decomposition in a CP setting [12] [2] [51] [18] [19]. [34] presents an approach that leverages a decomposition of the problem in the context of MPSoC systems. The authors tackle the mapping sub-problem with ILP and the scheduling one with CP. The work considers only pipelined streaming applications and does not handle conditional task graphs. In order to solve the problem of allocating and scheduling a general conditional task graph onto a MPSoC, the introduction of more complex problem models and cost functions, such as more complex subproblem relaxations and Benders cuts is needed, and is tackled for example in [30] and [29]. Allocation and scheduling of Conditional Task Graphs is a relevant problem in system design, and it is usually solved heuristically: for instance in [10] a genetic algorithm is devised on the basis of a conditional scheduling table whose (exponential

5

number of) columns represent the combination of conditions in the CTG and whose rows are the starting times of activities that appear in the scenario. The number of columns is indeed reasonable in a number of real applications. Conditional scheduling tables are also used in some list scheduling based approaches like [9]. The same work, together with [52] and [53], also considers a fixed start time schedule instead of using a scheduling table, thus avoiding exponential space complexity. Finally, to our knowledge [24] is the only complete approach for allocation and scheduling of Conditional Task Graphs; it uses Constraint Programming for modeling the allocation and scheduling problem. Indeed the solving algorithm used is complete only for small task graphs (up to 10 activities). Another complete CP based approach is described in [25] and targets low level instruction scheduling with Hierarchical Conditional Dependency Graphs (HCDG); conditional dependencies are modeled in HCDGs by introducing special nodes (guards), to condition the execution of each operation; complexity blowup is avoided by providing a single schedule where operations with mutually exclusive guards are allowed to execute at the same time step even if they access the same resource. We basically adopted the same approach to avoid scheduling each scenario independently. Mutual exclusions relations are listed in HCDGSs for each pair of tasks and are computed off line by checking compatibility of guard expressions, whereas in CTGs they are deduced from the graph structure; note the in-search computation described in the paper is just used to support speculative execution. Pairwise listing of exclusion relations is a more general approach, but lacks some nice properties which are necessary to efficiently handle non-unary capacity resources; in particular computing worst case usage of such a resource is a NP-complete problem if only pairwise mutual exclusions are known (see section 7.3.2); in fact, in [25] only unary resources are taken into account. To our knowledge, no complete approach exists that can handle conditional task graphs of practical interest. Moreover, we are unaware of any published works that report validation results on the actual execution of applications on MPSoC targets. Our work addresses both these open issues.

3 Target Architecture: platform resource description Our mapping strategy targets a general template for a message-oriented distributed memory architecture. An embodiment of this template architecture is considered in this work, in order to be able to provide input data to the optimization framework and to validate its solutions based on functional simulation. The specific platform instance, conforming to the template, only determines the annotated values in the application task graph (cost for communication and execution times), which is an input to our framework. However, alternative architectures matching the same template can be input to our methodology, with just the burden to re-characterize the costs for basic communication and synchronization mechanisms. Therefore, the allocation and scheduling methodology we propose is not affected by specific design choices (e.g., the kind of processing unit, the bus architecture). The characteristics of the architectural template targeted by our optimization framework include: 1. multiple computation tiles, also referred to as processing elements, PE; each tile consists of a processing core, instruction and data Level 1 caches and a local memory device.

6

2. support for message exchange between the computation tiles, 3. availability of remote storage devices for those program data that cannot be stored in local memories. These devices are non-local to the tiles and accessible through the system bus. In this work the computation tiles are assumed to be homogeneous and processing elements PE are modeled as unary resources in the optimizer; the number of tiles (and consequently of processing elements) is referred to as np . No preemption support is assumed: this is quite reasonable, as in many currently available commercial platforms support for preemption is indeed scarce. The whole framework can be easily extended in order to take into account heterogeneous processing elements; in this case, task execution time is allocation dependent, and must be characterized for each different type of core. Furthermore, we can also consider the case in which some tasks cannot be allocated to some cores. In embedded systems and applications, where power consumption and cost play more important roles than versatility, local memories can be used besides/instead of caches. Local memory is a small memory strictly connected to a processing core: it is quite similar to a cache memory in terms of size and speed (ideally one-cycle access time), but it has no dedicated logic for dynamic swapping of contents. While a cache uses a hardware controller to decide which data to keep in cache memory and which data to prefetch, the local memory approach does not require any hardware support in addition to the memory itself, but requires software to control all data transfers to and from scratchpad memories. For this reason, local memories sometimes are also called “software controlled caches”. In general it is designer responsibility to explicitly map addresses of external memory to locations of the local memory; optimal memory allocation of task program data to the local versus the remote memory is a specific goal of our optimization framework; on this purpose each local storage device is modeled as a reservoir and assumed to have finite capacity Capj . It is possible to implement both a local memory and a cache at the same time, exploiting the respective advantages; caches require no optimizer intervention, as they are hardware managed. The remote storage can be provided by a unified memory with partitions associated with each processor or by a separate private memory for each processor core connected to the system bus. This assumption concerning the memory hierarchy reflects the typical trade-off between low access cost, low capacity local memory devices and high cost, high capacity memory devices at a higher level of the hierarchy. The remote storage device is in general high capacity, and is assumed to be infinite for the problem at hand. By adopting an additive approach (see [28]), the bus is modeled in the optimizer as a discrete resource with capacity equal to the provided data transfer rate (bandwidth). The additive model applies as well to many non-bus topologies, such as crossbar configurations or network-on-chips; however in such cases the provided bandwidth is so large that the communication resources are unlikely to become a serious bottleneck.

3.1 Clock-cycle accurate simulator We developed an accurate (at the level of clock cycle) simulator for this architectural template (see Fig. 1) in order to prove the viability of our approach . The computation tiles are supposed to be homogeneous and consist of ARM cores (including instruction

7

Fig. 1 Message-oriented distributed memory architecture.

and data caches) and of tightly coupled software-controlled scratchpad memories for fast access to program operands and for storing input data. We used an AMBA AHB bus as system interconnect. A DMA engine is attached to each core, as presented in [39], allowing efficient data transfers between the local scratchpad and non-local memories reachable through the bus. The DMA control logic supports multichannel programming, while the DMA transfer engine has a dedicated connection to the scratchpad memory allowing fast data transfers from or to it. In order to communicate with each other, cores use non-cachable shared memory. For the synchronization among the processors, semaphore and interrupt facilities are used: 1. a core can send interrupt signals to each other using the hardware interrupt module mapped on the global addressing space; 2. several cores can synchronize using the semaphore module that implements testand-set operations. Finally, each processor core has a private on-chip memory, which can be accessed only by gaining bus ownership. In principle, it could be also an off-chip memory. In any case, it has a higher access cost and is used to store program operands that do not fit in scratchpad memory. The software support is provided by a real-time operating system called RTEMS [44]. Our implementation thus supports: – either processor or DMA-initiated memory-to-memory transfers, – either polling-based or interrupt-based synchronization, and – flexible allocation of the consumer’s message buffer to the local scratchpad or the non-local private memory. The architecture is assumed to provide a hardware-software support for messaging, targeting scalability to a large number of communicating cores. Messages can be exchanged by tasks through software communication queues, which can be physically allocated either in scratchpad memory or in shared memory, depending on whether the tasks are mapped onto the same processor or not. This assumption avoids to generate bus traffic and to incur congestion delays for local communications. We also target architectures where synchronization between producer-consumer pairs does not give rise to semaphore polling traffic on the bus, since this might unacceptably and unpredictably degrade performance of ongoing message exchanges. Interrupt-based synchro-

8

nization or the implementation of distributed semaphores at each computation tile are two example mechanisms matching our requirements. Several platforms matching our template are available as commercial products on the market. Clearly, our architectural template is a simplified view of these highlycomplex commercial platforms, but it is similar in their main communication and synchronization support features. ARM MPCore [3] can be classified into this category. This processor incorporates up to four ARM11 CPUs. The communication between different cores is established through shared memory, while the synchronization can be obtained both via interrupts and semaphores [22]. Another on the market microprocessor architecture which can be classified in our target template is CELL [8]. CELL is a heterogeneous multi-core architecture where each processing element contains a local memory area. The architecture of the CELL processor closely matches the template we are targeting, because of its synergistic processing elements with local storage and of its support for message-based stream processing. Another solution which is similar to our hardware model is the Multi-processor DSP Family from Cradle Technologies [7].

4 Target Application The multi-task application to be mapped and executed on top of the target hardware platform is represented as a conditional task graph with precedence constraints. In the following we describe some preliminaries on Conditional Task Graph and on the high level application.

4.1 Conditional Task Graph Definition 1 A CTG is a directed acyclic graph that consists of a tuple hT, A, C, P i, where – T = TB ∪ TF is a set of nodes; ti ∈ TB is called branch node, while ti ∈ TF is a fork node. TB and TF partition set T , i.e., TB ∩ TF = ∅. Also, if TB = ∅ the graph is a deterministic task graph. Each node with one or less outgoing arc is considered a fork node. – A is a set of arcs as ordered pairs ah = (ti , tj ). – C is a set of pairs hti , ci i for each branch node ti ∈ TB . ci is the condition labeling the node. – P is a set of triples hah , Out, P robi each one labeling an arc ah = (ti , tj ) rooted in a branch node ti . Out = Outij is a possible outcome of condition ci labeling node ti , and P rob = pij is the probability that Outij is true (pij ∈ [0, 1]). The CTG always contains a single root node (with no incoming arcs) that is connected (either directly or indirectly) to each other node in the graph. For each branch node ti ∈ TB with condition ci every outgoing arc (ti , tj ) is labeled P with one distinct outcome Outij such that (ti ,tj ) pij = 1.

Intuitively, at runtime, only a subgraph of the CTG will execute, depending on the branch node condition outcomes. The root node always execute. Each time a branch node is executed, its condition is evaluated and only one of its condition outcomes is

9

Fig. 2 A Conditional Task Graph

true. Consequently the corresponding outgoing arc is evaluated to true. In figure 2A if condition a is true at runtime, then arc (t1 , t2 ) status is true and node t2 executes, while arc (t1 , t5 ) status is false and node t5 does not execute. With an abuse of notation we have omitted the condition in the node and we have labeled arc (t1 , t2 ) with a meaning a = true and (t1 , t5 ) with ¬a meaning a = f alse. Each time a fork node is executed all its outgoing arcs are evaluated to true. Intuitively fork nodes originate parallel activities, while branch nodes have mutually exclusive outgoing arcs. A set of arcs is mutually exclusive if only one of them can be evaluated to true at execution time. We also need to define and-nodes and or-nodes. A node with more than one ingoing arc is an or-node if its ingoing arcs are mutually exclusive, it is instead an and-node if it is possible that all incoming arcs are evaluated to true at execution time; mixed nodes are not allowed, but can be easily modeled by properly combining and-nodes/or-nodes. For instance, in figure 2A, t0 is the root node and it is a fork node. Arcs (t0 , t1 ) and (t0 , t12 ) rooted in a fork node are always evaluated to true. Node t1 is a branch node, labeled with condition a. The probability of a = true is 0.5 and the probability of a = f alse is also 0.5 as depicted in figure 2B. Node t20 is an or-node while node t21 is an and-node. At execution time, we are interested in the concept of scenario. A scenario corresponds to an assignment of outcomes to conditions and defines a deterministic task graph containing the set of nodes and arcs that executes in the scenario. Given a CTG=hT, A, C, P i, and a scenario s in the set of all possible scenarios S, the deterministic task graph TG(s) associated with s is defined as follows: – The root node always belongs to the TG(s) – An arc ak = (ti , tj ) belongs to TG(s) if it originates in a fork node ti that belongs to TG(s) or if it originates in a branch node ti that belongs to TG(s) and has an associated outcome ck ∈ s. – A node ti belongs to TG(s) if it is an and-node and all ingoing arcs ak are in TG(s) or if it is an or-node and only one ingoing arc ak is in TG(s) The deterministic task graph derived from the CTG in figure 2 and associated to the runtime scenario a = true, d = true and e = f alse is depicted in figure 3. We need now to associate a probability to each scenario. ∀s ∈ S p(s) =

Y

cr ∈s

pcr

10

Fig. 3 The deterministic task graph associated to the runtime scenario a = true, d = true and e = f alse

In this paper we are interested in computing the probability of sets of scenarios. In particular, we are interested in all scenarios where a given task executes and in all scenarios where pairs of tasks execute. The probability of the scenarios where task ti executes can be computed as follows: p(ti ) =

X

p(s)

s|ti ∈T G(s)

while the probability a pair of tasks execute in the same scenario is p(ti ∧ tj ) =

X

p(s)

s|ti ,tj ∈T G(s)

Finally, for modeling purposes, we also define for each task an activation function fti (s); this is a stochastic function such that: fti : S → {0, 1} and fti (s) =



1 if ti ∈ T G(s) 0 otherwise

4.2 Application and Task Modeling Given a CTG representing the high level application, we interpret each node as an execution task and each arc as a communication task between two executing tasks.

Fig. 4 Task execution phases

In particular, we model execution and communication tasks as non-preemptive activities with a five-phase behavior (see figure 4): they read all communication queues (as a set of READ QUEUE activities) corresponding to the incoming arcs in a node, possibly read their internal state (as a single READ STATE activity), execute, write their state (as a single WRITE STATE activity) and finally write all the communications queues (as a set of WRITE QUEUE activities) corresponding to the outgoing arcs of a node.

11

Each phase requires a portion of a storage device for a specific memory requirement; each of such memory requirements can be allocated either locally in the local memory or remotely on the on-chip memory For modeling purposes we model each activity with a fixed duration statically depending on memory allocation choices, and referred to as DU R(act) throughout the paper. EXECUTE Activity: Each execution task ti is labeled with its worst case execution time (WCET), a release date rl(ti ) and a deadline dl(ti ) that play a critical role whenever application real-time constraints are to be met. Storage locations are required for computation data and for processor instructions (globally ”program data”), we refer to this quantity as mi . In the proposed method the WCET is computed via repeated execution on the cycle accurate simulator and depends on memory allocation: throughout the paper we refer as Die to the execution time with remote program data allocation and as dei to the execution time with local allocation. READ/WRITE STATE Activities: each task may have some internal state, required when a task can have more than one single state. In our framework the internal state of a task is represented by those data structures which univocally characterize the actual state of the task. We need to provide storing space to internal state in order to reliably schedule these tasks in case of multiple activations. We refer to this quantity as sti . A local allocation of internal state and program data is allowed on the processor where the corresponding executing task runs. Note that, if the program data and internal state are locally allocated, the duration of this activity is zero, while if they are remotely allocated, a duration is associated to the activity of data transferring (namely Dirs , Diws ). In fact, before executing the task, its data should be transferred from the remote to the local memory device and this operation takes time. Data transfer associated to execution and state/writing activities with remote memory allocation requires a portion of the bus bandwidth (referred to as req(act)) and generate some bus traffic (mtrafi to access program data, strafi for state reading/writing). Both the bandwidth requirement and the generated traffic depend on the size of the data to exchange and are computed for each task phase separately. READ/WRITE QUEUE Activities: Each arc between two tasks ar = (ti , tj ) represents a communication. Each arc is labeled with the memory requirement for communication queues. The communication requirement, i.e., the amount of data that need to be exchanged between two tasks is referred to as cr . The communication queue related to arc ar = (ti , tj ) can be allocated locally only if both ti and tj run on the same processor. In this case, the duration of the read/write activity is zero. In the remote case, instead, data are first copied locally, then they are processed by tasks and this takes time (referred to as Drrd and Drwr , respectively for queue reading and writing – the superscript rd stands for “read”, while wr for “write”). These data transfer activities require a portion of the bus bandwidth (referred to as req(act)). Also we compute the generated bus traffic ctrafr to write/read communication queues. The problem at hand consists in finding an allocation of tasks to processors, of memory requirement to storage devices and of data transfer activities to the communication channel, and a schedule such that temporal, resource constraints and performance requirement are guaranteed (i.e. a task deadline constraints are met by the final

12

schedule) and the expected bus traffic (weighted sum of bus traffic in each scenario) is minimized. In particular, a solution to the problem at hand consists in: a. for each task ti in the set T : (1) a specification of the PE the task is mapped to, (2) an indication of whether the program data requirements has to be mapped on the local or on the remote device (3), a similar indication for the state memory; b. for each arc (ti , tj ) in the set A: an indication of whether the queue buffer has to be mapped on the local or on the remote device c. for each activity corresponding to an execution and communication phase of each task: a fixed start time all such values should be consistent with precedence constraints imposed by the CTG and also be consistent with resource capacity constraints. In addition, all such values must be valid whatever the execution scenario is; for example this allows two activities sharing the same unary resource to be assigned the same start time if they never appear in the same execution scenario.

5 Development Methodology In this section we explain how an optimizer can be used in the context of a real system-level design flow. Figure 5 shows a pictorial overview of the overall application

Fig. 5 Application development methodology.

development methodology flow proposed. It is composed by different phases leveraging on distinct facilities. The main phases are:

13

1. 2. 3. 4.

the application implementation, via the Customizable Application Template; the application characterization, through the Cycle Accurate Simulator; the optimization of hardware resource utilization exploiting our Optimizer; the final optimal system software implementation leveraging the Resource Management Support; 5. the platform execution which relies on our Run-time support.

The starting point is a Conditional Task Graph describing the target application. This high level description is translated into real software implementation using the Customizable Application Template. The second step consists of using a virtual cycle-accurate platform to pre-characterize the input task set and compute the worst-case execution times. The task graph is annotated with computation time, amount of communication and storage requirements. The task execution time is given in two cases: program data is stored entirely in local memory and program data is stored in remote memory only. In this latter case, the impact of cache misses on execution time is taken into account. The same is done for the communication durations, i.e. measuring them with queues stored in local memory and remotely. The task graph is also annotated with the amount of bus bandwidth required by each activity: it is dependent on the size of the data to exchange, on its duration, and it is computed for each task phase separately. However, in a given execution not all tasks will run on the target platform: in fact, the application contains conditional branches (like if-then-else control structures). Therefore, an accurate application profiling step is needed, from which we have a probability distribution on each conditional branch that intuitively gives the probability of choosing that branch during real future execution. We model task communication and computation separately to better account for their requirement on bus utilization, although from a practical viewpoint they are part of the same atomic task. The initial communication phase consumes a bus bandwidth which is determined by the hardware support for data transfer and by the bus protocol efficiency (latency for a read transaction). The computation part of the task instead consumes an average bandwidth defined by the ratio of program data size (in case of remote mapping) and execution time. A less accurate characterization framework can be used to model the task set, though potentially incurring more uncertainty with respect to optimizer solutions. The input task parameters are then fed to the optimization framework, which provides optimal allocation of tasks and memory locations to processor and storage devices respectively, and a feasible schedule for the tasks meeting the real-time requirements of the application. After the optimization phase, we can build the optimal implementation of our target software system using both the optimizer solution for the hardware platform (i.e. optimal allocation and scheduling) and the application development support (i.e. Customizable Application Template and OS-level and Task-level Run-Time support). Finally the application code can then be uploaded on the target platform and executed. In this paper, we focus specifically on the optimization phase and its implementation. We will describe the optimal stochastic allocation and scheduling algorithm in section 7. To make the paper self contained, we also introduce the other components along with proper references in the next section.

14

6 Efficient Application Development Support In this section we describe our new application development support. It is mainly composed by a generic Customizable Application Template and a set of high-level APIs. Our facilities tackle both OS-level issues, such as task allocation and scheduling, as well as task-level issues, like inter-task communication and synchronization. The main goal of our development framework is the exact and reliable execution of application after the optimization step, giving at the same time guarantees about high performance and constraint satisfaction. For more technical details, please refer to our previous work in this area [11]. We set up a generic customizable application template allowing software developers to easily and quickly build their parallel applications starting from a high-level task and data flow graph specification compliant to our previously described models. Programmers can at first think about their applications in terms of task dependencies and quickly draw the task graphs, and then use our tools and libraries to translate the abstract representation into C code. This way, they can devote most of their effort to the functionality of tasks rather than the implementation of their communication, synchronization and scheduling mechanisms. User can configure the Customizable Application Template via XML file, which will be automatically translated into C-code. We implemented also an Eclipse plug-in graphical interface in order to make the configuration of the Customizable Application Template easier and less error-prone. For every task indicated within the application template, C–code is automatically generated which reflects the considered task computational model (i.e. Reading Input Phase, Reading State Phase, Execution, etc.). Following our scalable and parameterizable template, we also ensure that the final implementation of the target application will be compliant with the modeling assumptions of the optimizer, and that the optimal performance and the constraint satisfaction of computed mapping solutions will be achieved in practice. We implemented a set of APIs by which users can reproduce optimizer solutions on their target platform with great accuracy. Once the target application has been implemented using our generic customizable template, tasks, program data and communication queues are allocated to the proper hardware resources (processor or memory cores) as indicated by the computed allocation solution. In order to reproduce the exact scheduling behavior of the optimizer, we implemented a scheduling support middleware in the target platform. Software support for efficient messaging is also provided by our set of high-level APIs. The communication and synchronization library abstracts low level architectural details to the programmer, such as memory maps or explicit management of hardware semaphores or interrupt signaling. Messages can be directly moved between scratchpad memories.

7 Optimization phase The focus of this paper is the optimization phase. We describe here the problem model and the corresponding algorithms for its solution.

15

7.1 Problem model The problem we face is the following: given a CTG, we have to map each node/task of the CTG onto a processing element, and each memory requirement (program data, internal state and communication queues) on a local/remote storage device, and to schedule tasks and communications on the available resources. Since the CTG encapsulates different run time scenarios, we have to guarantee that for each possible scenario all temporal and resource constraints are satisfied. The objective function we have to minimize is the bus usage, called communication cost. When more scenarios have to be taken into account, we should not consider the deterministic value of an objective function, but its expected value, namely a weighted sum of the objective function value in each scenario weighted by the scenario probability. We therefore should minimize the expected bus utilization. The optimal solution to our problem is in fact a unique assignment of starting times and resources to tasks that is feasible whatever the runtime scenario is, that minimizes the expected value of the bus utilization. Note that the bus utilization to be minimized counts two contributions: one related to single tasks, since once computation data and/or internal state are physically allocated to remote memory a number of bus accesses should be performed. This communication depends on the amount of data to be stored. The second contribution is related to pairs of communicating tasks in the task graph. If two communicating tasks are allocated onto two different processors they should access the bus. This contribution depends on the amount of data the two tasks should exchange. Problem structure A number of papers in the recent literature [18], [19], [28], [51] suggest that scheduling problems with alternative resources are efficiently solved via the so called Logic-based Benders Decomposition that works as follows: the allocation problem (called master problem) is solved first, and the scheduling problem (called subproblem) later. The master is solved to optimality and its solution passed to the subproblem solver. If the solution of the master is feasible for the subproblem constraints, then the overall problem is solved to optimality. If, instead, the master solution cannot be completed by the subproblem solver, a no-good is generated and added to the model of the master problem, roughly stating that the solution passed should not be recomputed again (it becomes infeasible), and a new optimal solution is found for the master problem respecting the (set of) no-good(s) generated so far. Given the structure of the allocation and scheduling problems, we have solved the allocation problem via Integer Linear Programming and the scheduling problem via Constraint Programming. In Constraint Programming (CP) problems are modeled declaratively by defining a set of variables representing problem entities, each variable has an associated domain representing possible variable assignments and a set of constraints, limiting the values that variables can simultaneously assume. A solution of a constraint program is an assignment of values to variables which is consistent with constraints. The solving process of a constraint solver interleaves constraint propagation and search. Each constraint is propagated so as to remove a priori those values that cannot appear in any consistent solution. Then, since propagation is not complete, i.e., some values left in the domain can still be inconsistent, search is performed. The process of constraint propagation and search is iterated as long as a solution is found or a failure occurs. One

16

of the most successful applications of CP to date is scheduling. Problem variables are activity starting times. Many resource and temporal constraints have been devised for scheduling applications. For a survey on existing constraints and filtering algorithms the interested reader can refer to [35]. Integer programming is an older method, with roots that date back to the late 1950s. Integer Programming can be thought of as a restriction of Constraint Programming. In fact, Integer Programming has only two types of variables: integer variables whose domain contain non-negative integers and continuous variables whose domain contain non-negative real values. In addition, IP allows only one type of constraint: linear inequalities. Finally, the objective function must be linear in the variables. The solving principle of IP is based on the solution of the linear relaxation, allowing arbitrary sets of linear constraints to be treated as a global constraint, providing a global view of the problem. The relaxation provides a bound enabling efficient pruning of the search tree and directing search toward promising regions.

7.2 Allocation problem model With regards to the platform described in section 3, the allocation problem can be stated as the one of assigning tasks to processing elements and memory requirements to storage devices. First, we state the stochastic allocation model, then we show how this model can be transformed into a deterministic model through the use of existence and co-existence probabilities of tasks. To compute these probabilities, we propose two polynomial time algorithms exploiting the CTG structure.

7.2.1 Stochastic integer linear model Suppose nt is the number of tasks, np the number of processing elements (PE), and na the number of arcs. We introduce for each task and each PE a variable Tij such that Tij = 1 iff task i is assigned to processing element j. We also define variables Mij such that Mij = 1 iff task i allocates its program data locally, Mij = 0 otherwise. Similarly we introduce variables Stij for task i internal state requirements and Crj for arc r communication queue; both Stij and Crj are 1 if the corresponding memory requirement is mapped to the local device of PE j, they are 0 otherwise. The objective function depends also on the runtime scenarios s. We call S the set of all possible runtime scenarios. We want to minimize the bus traffic expected value. The allocation model is defined as follows:

17

min z = E(trf(M, St, C, S)) np −1

s.t.

X

Tij = 1

∀i = 0, . . . nt − 1

(1)

j=0

Stij ≤ Tij

∀i = 0, . . . nt − 1, j = 0, . . . np − 1

(2)

Mij ≤ Tij

∀i = 0, . . . nt − 1, j = 0, . . . np − 1

(3)

Crj ≤ Tij

∀arcr = (ti , tk ), r = 0, . . . na − 1, j = 0, . . . np − 1

(4)

Crj ≤ Tkj

∀arcr = (ti , tk ), r = 0, . . . na − 1, j = 0, . . . np − 1

(5)

nX t −1 i=0

  sti Stij + mi Mij +

nX a −1

cr Crj ≤ Capj

∀j = 0, . . . np − 1

(6)

r=0

Constraints (1) force each task to be assigned to a single processor. Constraints (2) and (3) state that program data and internal state can be locally allocated on the PE j only if the corresponding task i runs on it. Constraints (4) and (5) enforce that the communication queue of arc r can be locally allocated only if both the source and the destination tasks run on the same processing element j. Finally, constraints (6) ensure that the sum of locally allocated internal state (sti ), program data (mi ) and communication (cr ) memory cannot exceed the scratchpad device capacity (Capj ). All tasks have to be considered here, regardless they execute or not at runtime, since a scratchpad memory is statically allocated. Quite standard symmetry breaking constraints have been added to the model; namely, lexicographic ordering has been forced on the sets of tasks allocated to different PEs. In particular we require: min {i | Ti,j = 1} < min {i | Ti,j+1 = 1}

ti ∈T

ti ∈T

This can be enforced by posting the following set of linear constraints: np −1

X

k=j+1

Tik ≤

j i−1 X X

Thk

∀i = 0, . . . nt − 1, ∀j = 0, . . . np − 2

h=0 k=0

where the left member tells whether a task with index i is on a processor with index higher than j; this requires the right member to be 1, i.e. at least a task with index lower than i to be allocated to a processor with index not greater than j. All problem constraints must be satisfied independently from the run time scenario. On the contrary, scenarios have to be considered in the objective function expected value. The expected value of the bus traffic is computed taking into account the set S of all possible runtime scenarios and their probabilities p(s) as follows: E(trf(M, St, C, S)) =

X

s∈S

p(s)trf(M, St, C)(s)

18

where: trf (s) =

nX t −1

task trfi (M i , Sti , s) +

i=0

i

comm trfr (C r , s)

X

ar =(ti ,tk )



i



np −1

task trfi (M , St , s) = fti (s) mtrafi 1 − r



X

j=0





Mij  + strafi 1 −



comm trfar =(ti ,tk ) (C , s) = fti (s)ftk (s) ctrafr 1 −

np −1

X

j=0



np −1

X

j=0



Stij 

Crj 

where where M i , Sti denote the set of Mij and Stij variables related to task ti and C r is the set of Crj variables related to arc ar ; moreover mtrafi , strafi , ctrafr are the amounts of bus traffic respectively coming from accesses to program data and internal state of task ti , and to the communication buffer related to arc ar ; note that by using weights one can take into account multiple accesses to the same data: in particular ctrafr counts both read and write accesses to buffer ar . In the task trf(s) expression, if task ti executes in s (thus the stochastic function Pnp −1 introduced in section 4.1 fti (s) = 1), then (1 − j=0 Mij ) is 1 iff the task i program data are remotely allocated. The same holds for the internal state. In the comm trf(s) expression we have a contribution if both the source and the destination task execute (thus the stochastic functions introduced in section 4.1 fti (s) = ftk (s) = 1) and the Pnp −1 queue is remotely allocated (1 − j=0 Crj = 1). As one can see, the considered objective function is a sum of a very large number of terms, possibly depending on pairs of decision variables: for example the C variables depend on pairs of tasks mapping decisions through constraints (4) and (5). As the CP propagation mechanism tends to perform poorly on this type of expression, such a structure of the objective function was one of the main motivations behind the choice of ILP for the solution of the allocation problem. 7.2.2 Transformation in a deterministic model In most cases, the minimization of a stochastic functional, such as the expected value, is a very complex operation, since it often requires to repeatedly solve a deterministic subproblem [13]. The cost of such a procedure is not affordable for hardware design purposes since the deterministic subproblem is by itself NP-hard. One of the main contributions of this paper is the way to reduce the bus traffic expected value to a deterministic expression. Since all tasks have to be given a PE and memory device before running the application, decision variables do not depend on branch outcomes and the allocation is a stochastic one stage problem: thus, for a given task-PE assignment, the expected value depends only on the stochastic variables. Intuitively, if we properly weight the bus traffic contributions according to task probabilities and we sum up over all scenarios, we should be able to get an analytic expression for the expected value. Now, since both the expected value operator and the bus traffic expression are linear, the objective function for a given allocation can be decomposed into task related and arc related blocks: X E(trf(M,St,C,s)) = p(s)trf(M,St,C,s) s∈S

19

E(trf(M,St,C,s)) =

X

s∈S



p(s) 

nX t −1

i

i

X

task trfi (M , St , s) +

i=0

r

ar =(ti ,tk )



comm trfr (C , s)

Since for a given allocation the objective function depends only on the stochastic variables, the contributions of decision variables are constants with respect to scenarios. Let them be:     np −1 np −1 X X i i (7) Stij  Mij  + strafi 1 − KTi (M , St ) = mtrafi 1 − j=0

j=0





np −1

KCr (C r ) = ctrafr 1 −

X

j=0

(8)

Crj 

where KTi (M i , Sti ) = 0 if task ti allocates both its computation memory and the internal state on the local device; similarly KCr (C r ) = 0 if the queue buffer corresponding to arc arcr is locally allocated. By substituting KT and KC in the formula, we get:

E(trf) =

X

s∈S



p(s) 

nX t −1

fti (s)KTi (M i , Sti ) +

i=0

X



ar =(ti ,tk )

fti (s)ftk (s)KCr (C r )

This can be rewritten as:

E(trf(M,St,C,s)) =

nX t −1 i=0

+

i

i



KTi (M , St ) 

X

r



KCr (C ) 

ar =(ti ,tk )

X

s∈S

X

s∈S



p(s)fti (s) +



p(s)fti (s)ftk (s)

P The term s∈S p(s)fti (s) is the probability ofPall scenarios where a given task executes (referred to as p(ti ) in section 4.1), while s∈S p(s)fti (s)ftk (s) is the probability that both tasks ti and tk execute in the same scenario (referred to as p(ti ∧ tk ) in section 4.1).

E(trf) =

nX t −1 i=0

KTi (M i , Sti ) · p(ti ) +

X

KCr (C r ) · p(ti ∧ tk )

(9)

arcr =(ti ,tk )

To apply the transformation we need both those probabilities; moreover, to achieve an effective overall complexity reduction, they have to be computed in a reasonable time. We developed two polynomial time algorithms to compute these probabilities.

20

7.2.3 Probability of a node All developed algorithms are based on three data structures derived from the CTG, namely the activation set of a node, the exclusion matrix and the sequence matrix. In Figure 6A we show an example of a CTG on the left and the related data structures: Definition 2 The Activation Set AS(ti ) of a Task Graph node ti is the set of condition outcomes Outh on all paths from the root node to ti . For instance the activation set of node n in Figure 6A contains outcomes a, b and not c from one path and a, not b and d from the second path. The following postulate holds: Postulate 1 AS(ti ) contains all conditions on which the execution of ti depends; i.e. outcomes in AS(ti ) are sufficient for triggering the execution of ti . This is a direct consequence of the graph being acyclic. Definition 3 The Exclusion Matrix (EM) of a CTG is a binary no × no matrix (where no is the number of condition outcomes) such that EMhk = 1 iff outcomes Outh and Outk label outgoing arcs of the same branch. For instance EMb¯b = 1 since the respective arcs originate at the same branch node. Note that if EMhk = 1 and h 6= k, then Outh and Outk are mutually exclusive. Definition 4 The Sequence Matrix (SM) of a CTG is a binary no × no matrix such that SMhk = 1 (with h 6= k) iff, in order for some node to execute, either Outh requires Outk or Outk requires Outh . More formally SMhk = 1 if and only if: ∃ti ∈ T such that:

1) either ∀s | ti ∈ T G(s) : Outh ∈ s ⇒ Outk ∈ s 2) or ∀s | ti ∈ T G(s) : Outk ∈ s ⇒ Outh ∈ s

For instance SMa¯b = 1 since they are both needed for node n in Figure 6A to execute, while SM¯bc = 1 since ¯b requires c to trigger the execution of node t21 in Figure 2. Intuitively, EM and SM give structure information about the whole graph, while AS(ti ) can be used as a kind of mask to focus on the subpart of EM and SM related to a single task, formally on the cells EMhk and SMhk such that Outh , Outk ∈ AS(ti ).

Fig. 6 (A) An example of the three data structures; (B) Computation of the probability of node

21

All these data structures can be extracted from the graph in polynomial time; in particular the exclusion matrix directly comes from the graph structure, while all activation sets in the graph can be incrementally computed by visiting once each node and arc in topological order (thus in time O(max(nt , na ))). Computing SMhk is not trivial at all, but can be done in polynomial time: a proof in form of basic algorithmic steps is reported in Appendix A. There exists an interesting connection between the AS, EM, SM data structures and the so-called activation event of a node. In particular, let S be the set of all possible scenarios (assignments of compatible outcomes to conditions); let S(ti ) be the set of scenarios s corresponding to the deterministic task graphs T G(s) where task ti executes. The elements of S(ti ) are all the sets of outcomes s sufficient for ti to be in T G(s). Since the execution of ti only depends on the conditions found on the paths from the root node to ti (and hence in AS(ti )), we can restrict the attention to outcomes found on such paths and get all the necessary and sufficient sets for the execution of ti . We can then define the activation event of ti : Definition 5 The Activation Event AE(ti ) of a task ti is a logical formula in Disjunctive Normal Form such that:   ^ _  Outh  AE(ti ) = s∈S(ti )

Outh ∈s∩AS(ti )

By definition, each conjunction of outcomes in AE(ti ) is necessary and sufficient for ti to execute; hence ti runs if and only if AE(ti ) is true. Observe that some of the conjunctions of outcomes in AE(ti ) end out to be the same, namely if the corresponding scenarios s do not differentiate for some outcome in AS(ti ). The following theorem shows that the activation event of ti can be derived from AS(ti ) and SM. Theorem 1 Let AS(ti ) be the activation set of a task ti and let C(AS(ti )) be the set of conditions cj such that two or more outcomes of cj are in AS(ti ). Then the activation event AE(ti ) can be derived by: 1. Considering all consistent assignments of outcomes in AS(ti ) to conditions in C(AS(ti )); an assignment of outcomes is consistent if it is part of at least a scenario. Let us refer as Out to the assigned outcomes. 2. Extending each Out with the outcomes Outh such that SMhk = 1 for every Outk ∈ Out. Formally: Ext(Out) = Out ∪ {Outh ∈ AS(ti ) | SMhk = 1 ∀Outk ∈ Out} Then, it holds: AE(ti ) =

_

all consistent Out

 

^

Outh ∈Ext(Out)



Outh 

The importance of Theorem 1 comes from the fact that it combines global information from the sequence matrix and local information from the activation set, to get a condition for the execution of ti . The formal proof is rather complex and is reported in Appendix B.

22

Once the data structures are available, we can determine the existence probability of a node ti using Algorithm 1 (A1), that is used to compute p(ti ) in equation (9). In the algorithm the notation SMh stands for the set of condition outcomes in conjunctive relation with a given outcome Outh ; formally: SMh = {Outk ∈ P | SMhk = 1}; similarly, EMh = {Outk | EMhk = 1} is the set of condition outcomes originating at the same branch as Outh . Moreover, the algorithm requires outcome indices to be sorted according to graph topology; that is, if Outh is met before Outk during a topological visit of the CTG, then h < k. Note that, as the CTG is acyclic, it is always possible to find a labeling of the outcomes compliant with the mentioned property. Intuitively, Algorithm A1 takes as input a set σ of condition outcomes (initially this is the activation set of the target node ti ), and works in a recursive fashion; as an intuition, at each step: 1. if no condition cj exists such that two or more outcomes of cj are in σ, then C(σ) = Out = ∅ and all outcomes are needed for ti to execute. In this case the probability is ΠOuth ∈σ p(Outh ). 2. if C(σ) 6= ∅, then we can choose a condition in C(σ) and partition the set according to the corresponding outcomes; by repeating the process several times, in practice we enumerate all meaningful combinations of outcomes Out for conditions in C(σ). By progressively identifying the corresponding set O(Out), we end up with a set of necessary and sufficient outcomes for the execution of ti ; for each such set the probability can be computed as in case 1. Progressive computations avoid the exponential cost of a pure enumeration.

Algorithm 1 Probability of an activation set (A1) — probability of a node or an arc 1: let σ be the input set for the algorithm; initially σ is the activation set whose probability we want to compute 2: find the first outcome Outh ∈ σ such that |EMh ∩ σ| > 1 3: if if no such outcome exists then Q 4: return p = Outh ∈σ p(Outh ) if σ 6= ∅, p = 1 otherwise 5: else 6: set B = EMh ∩ σ, σ = σ \ B 7: end if T 8: compute set C = Outh ∈B (σ ∩ SMh ) T 9: compute set R = Outh ∈B (σ \ SMh ) 10: set p = 0 11: for each outcome Outh ∈ B do 12: set p = p + A1(((σ ∩ SMh ) \ C) ∪ Outh ) 13: end for 14: return p = p · A1(C) · A1(R)

A formal discussion of the soundness of A1 is reported in Appendix C. In detail, at line 2 the algorithm checks the set σ for the presence of a set of condition outcomes originating at the same branch, i.e such that |EMh ∩ σ| > 1. Note that the information from the global EM matrix is restricted to the local scope of task ti by performing an intersection with σ. The minimum index outcome Outh is chosen. If no such set of outcomes exists, at line 3 Algorithm A1 computes the set probability as specified in point 1 in enumeration above; if σ is empty the returned probability is 1, so that for example the probability of the root node is A1(AS(root)) = A1(∅) = 1.

23

Conversely, if such a set SMh ∩ σ of mutually exclusive outcomes is found, it is given name B at line 6. Next the current set σ is partitioned into: 1. a subset C containing outcomes in σ required together with all Outh ∈ B (formally T Outh ∈B (σ ∩ SMh ), see line 8) 2. a subset R containing outcomes in σ non-required with any Outh ∈ B (formally T Outh ∈B (σ \ SMh ), see line 9) 3. a subset for each Outh ∈ B, containing Outh itself, plus outcomes in σ, exclusively required together with Outh (i.e. and non previously included in C: formally (σ ∩ SMh ) \ C, see line 12). Note those subsets never get empty, as they always contain at least Outh . The latter subsets are mutually exclusive: their probability is computed recursively and summed up at lines 11-13. Probability of subset C has to be multiplied by the value computed so far (as the contained outcomes are required by all Outh ∈ B). The set R, appearing in the algorithm, contains outcomes Outk ∈ σ, such that SMhk = 0 for all Outh in B: those outcomes trigger the execution of ti , but are independent on the chosen set B. For this reason the probability of R has to be multiplied by the value computed so far (see line 14). Note the R set is always empty, unless A1 is used to compute the probability of a coexistence set (see next subsection). Let us follow the algorithm on the graph in figure 6B, where we have to compute the probability of node n, whose activation set is AS(n) = {a,b, not b, not c, d}. First, the algorithm looks for a group of mutually exclusive condition outcomes in S (line 2): this is done by finding the minimum index outcome ch such that another mutually exclusive outcome cj ∈ EMh exists in S. In the example b is selected, as not b is part of AS(n) as well. Representing sets as bit vectors, line 2 has complexity O(no · bf ), where no is the number of outcomes in P and bf (branching factor) denotes the maximum number of outgoing arcs for all branch nodes; line 4 and 6 are completed in O(no ). Sets C and R are computed in O(no · bf ). Lines 11-13 altogether have complexity O(no bf ). The whole algorithm is repeated O(log(no )) times. Hence the overall complexity is O(no · log(no ) · bf ). 7.2.4 Coexistence probability of a pair of nodes We have to compute the probability that a pair of tasks execute in the same scenario so as to compute p(ti ∧ tj ) in equation (9). Given a pair of nodes i and j, we can determine a common activation set (coexistence set (CS)). Definition 6 The Coexistence Set of two nodes ti , tj is the set of all outcomes responsible of triggering the execution of both ti and tj . One can get CS(ti , tj ) starting from AS(ti ) ∪ AS(tj ) and by removing outcomes leading to the execution of ti but preventing the execution of tj (and vice-versa). This can be better specified by introducing the concept of Exclusion Set: Definition 7 The Exclusion Set EX(σ) of a set of outcomes σ is the set of conditions surely excluded by those in σ; in particular Outh ∈ EX(σ) if Outh is not in σ and if it is mutually exclusive with another outcome Outk ∈ σ; more formally: EX(σ) = {Outh ∈ P \ σ | EMh ∩ σ 6= ∅}

24

where P is the set of outcome triples in the CTG definition; EX(σ) can be computed in O(n2o ) (O(no ) to consider all possible outcomes, O(no ) to get EMh ∩ σ). The rationale is that, in the hypothesis that σ represents the outcomes having a part in the execution of some node, then EX(σ) contains the set of outcomes non compatible with the execution of such node; in practice, if an outcome Outh is in σ, then σ excludes from being true any other outcome Outk such that EMhk = 1, unless Outk is in σ as well. We can now specify a formal property of CS(ti , tj ): Theorem 2 A subset of outcomes σ ⊆ AS(ti ) sufficient to trigger the execution of ti and a subset of outcomes ρ ⊆ AS(tj ) sufficient to trigger the execution of tj are in CS(ti , tj ) if and only if ρ contains no outcome in EX(σ) and σ contains no outcome in EX(ρ). The statement is a direct consequence of Definition 7. Algorithm A2 directly exploits Theorem 2 to compute CS(ti , tj ), given AS(ti ) and AS(tj ). In particular, A2 builds sets σ in Theorem 2 by identifying “backward paths” from ti to the root node, and sets ρ by identifying “forward paths” from the root to tj . Note that at least one among sets AS(ti ) and AS(tj ) is supposed to be non-empty; if both AS(ti ) and AS(tj ) are empty, then ti and tj always execute and computing the coexistence set makes no sense. Algorithm 2 Coexistence set determination (A2) 1: if AS(ti ) = ∅ (or AS(tj ) = ∅) then 2: CS = AS(tj ) (or CS = AS(ti )) 3: else 4: CS = ∅ 5: while there are non-processed outcomes in AS(ti ) do 6: let Outh be the first outcome of a condition in C(AS(ti )) if C(AS(ti )) 6= ∅ otherwise, let Outh be the first non-processed outcome 7: compute set σ = AS(ti ) ∩ SMh 8: compute the set X = AS(tj S ) ∩ EX(σ) 9: compute set: C = AS(tj ) ∩ Outk ∈X SMk S 10: compute set: R = AS(tj ) ∩ Outk ∈AS(tj )\C SMk 11: set D = C \ R (outcomes to delete) 12: if AS(tj ) is not a subset of D then 13: set CS = CS ∪ σ ∪ (AS(tj ) \ D) 14: end if 15: end while 16: end if

The algorithm starts by selecting at line 6 a condition outcome Outh in AS(ti ) (for instance a in 1 figure 7). The outcome Outh is either such that another outcome of the same condition ck exists (hence ck ∈ C(AS(ti )), with the notation of Theorem 1) or just the first non-processed outcome if no such condition exists. In general Outh is not sufficient for the execution of ti , but requires other outcomes as well. Due to Theorem 1, given an outcome Outh ∈ AS(ti ) by building a set σ with Outh and all Outk ∈ AS(ti ) such that SMhk = 1, we get a set of outcomes corresponding to (possibly more than) one conjunction in AE(ti ); this is therefore sufficient to trigger the execution of ti . Algorithm A2 finds such a set σ, representing a group of backward paths, at line 6 (set σ in 2 figure 7). The algorithm identifies all forward paths in AS(tj ) non-compatible with the backward path σ; this is done by:

25 1

S={a, d}

2

AS(ni)={a, a, b, c, d} AS(nj)={a, b, b, c, e, e, f} a

d

a

d

c

b

b

c

i

SM =

e

e

f

f

j

a a b b c c d d e e f f

a 0 0 0 0 0 0 1 1 0 0 0 0

a 0 0 1 0 1 1 0 0 1 1 1 1

b 0 1 0 0 1 1 0 0 1 1 1 1

b 0 0 0 0 0 0 0 0 0 0 1 1

c 0 1 1 0 0 0 0 0 0 0 0 0

3

d 1 0 0 0 0 0 0 0 0 0 0 0

c 0 1 1 0 0 0 0 0 1 1 1 1

d 1 0 0 0 0 0 0 0 0 0 0 0

e 0 1 1 0 0 1 0 0 0 0 0 0

f 0 1 1 1 0 1 0 0 0 1 0 0

e 0 1 1 0 0 1 0 0 0 0 1 1

a

f 0 1 1 1 0 1 0 0 0 1 0 0

a

d

d

c

b

b

c

i e

e

f

f

j

4

EX(S) = EX({a, d})={a, d} R={b,f}

EX(S) ⋂ AS(nj)={a}

D = C \ R = {a, b, c, e, e}

C={a, b, c, e, e, f} a

d

a

d

c

b

a

b

d

c

a

d

c

b

CS(ASi, ASj) = CS(ASi, ASj) ⋃ S ⋃ {b, f}

b

c

i

i e

e

f

e

e

f

f

j

f

j

Fig. 7 Coexistence set computation

1. finding the elements in AS(tj ) non compatible with σ (set X at line 7). In 3 figure 7 the only outcome in the intersection is not a (crossed arc) 2. finding a set of “candidate outcomes” for elimination (set C at line 8); this contains all outcomes in X and outcomes in AS(tj ) requiring X to trigger the execution of tj (SMk ∩ AS(tj ) with Outk ∈ X) 3. find the set of outcomes in AS(tj ) requiring some non-candidate outcome (Outk ∈ (AS(tj ) \ C)) to trigger the execution of tj (set R at line 10). Those are indeed compatible with σ (for instance outcome f, in sequence with not b in 4 , figure 7) The set of non-compatible forward paths is identified by the set of outcomes D = C \R. The outcomes left in AS(tj ) identify a set of forward paths we are interested in. If AS(tj ) is not completely wiped out, both the forward path σ and the backward paths (AS(tj ) \ D) are added to CS(ti , tj ) (line 12). The algorithm goes on until all outcomes in AS(ti ) are processed. If there is no path from ti to tj (i.e. the coexistence set is empty) the two nodes are mutually exclusive and their coexistence probability is 0. Otherwise, the probability of a nonempty coexistence set can be computed once again by means of algorithm A1 (a proof is reported in Appendix D). Overall, algorithm A2 performs O(no ) iterations (where no is the number of outcomes); computing σ has complexity O(no ), computing set X takes O(n2o ) to get the exclusion set and O(no ) for the intersection with AS(tj ); getting C and R takes O(n2o ) and D takes O(no ); line 11 and 12 have time complexity O(no ). Therefore the total time complexity is O(n3o ). To conclude, with A1 and A2 we are able to compute the existence probability of a single node and the coexistence probability of a pair of nodes. Since the algorithms complexities are polynomial, the reduction of the bus traffic to a deterministic expression can be done in polynomial time.

26

7.3 Scheduling Model The scheduling subproblem has been solved by means of Constraint Programming. Since the objective function depends only on the allocation of tasks and memory requirements, scheduling is just a feasibility problem. We decided to provide a unique worst case schedule, forcing each task to execute after all its predecessors in any scenario. Tasks using the same resources can overlap if they never appear in the same runtime scenario (they are mutually exclusive). As already mentioned in section 4.2, each task is modeled as a set of non preemptive and atomic activities. Every activity is constrained to execute between a release time (rt) and a deadline (dl); for each activity act a start and an end variables are defined such that: start(act) ≥ rt(act) end(act) ≤ dl(act) end(act) = start(act) + DU R(act) The minimum element in the domain of the start variable is referred to as earliest start time (EST) of the activity, the maximum of the domain is referred to latest end time (LST); the earliest end time (EET) and the latest end time (LET) are defined similarly w.r.t the end variable. As explained in section 3, we model processing elements as unary resources, and the bus as a discrete resource with capacity equal to its bandwidth. Search is performed on the start time variables with a standard and very efficient “schedule or postpone” strategy (see [5]). Namely, at each search node the activity act with the lowest earliest start time t is selected and scheduled at t; if this leads to a fail act is postponed and never selected again, until its earliest start time changes due to propagation. Branching on possible activities ordering could be another option; note however that providing a total order for each group of activities running on the same processor is not sufficient, as possible conflicts on the bus have to be resolved as well. Precedence Constraint Posting approaches (see [36]) have been devised to tackle this issue and provide a schedule as a partial order of activities; note however that, on scheduling problems with fixed durations such as the one at hand, those approaches tend to have lower average case efficiency compared to scheduling over time; we therefore adopted the latter option. 7.3.1 Precedence relations Tasks are linked by precedence relations due to data communication, while other precedence relations result from the decomposition of each task in many activities. Both type of relations are modeled as constraints on the start and end variables. In particular, given two activities acti and actj both strict and loose precedence constraint are possible, respectively enforcing end(acti ) ≤ start(actj ) (actj executes after acti ) and end(acti ) = start(actj ) (actj executes immediately after acti ). The number and type of precedence constraints used depends on the type of the involved tasks (or/and, branch/fork) and follow rather complex schemata, in order to accurately model the actual runtime behavior of a task; an overview of all possible schemata is given in figure 8. In the picture the black arrows represent strict precedence relations, while the gray hyphened arrows are loose precedence relations.

27

Fig. 8 Task decomposition schema

In case a task has a single ingoing arc (input queue), at run time the execution phase starts immediately after the only reading activity; this is captured by a strict precedence relation (figure 8A) so that in this case the reading and writing activities are in fact glued together in order to improve the accuracy of the model. If more than one ingoing arc is present at run time the task can suspend on each input queue (if data are not yet available), but the execution phase starts with no delay when the last reading operation is over; this is modeled by introducing a “cover” activity, which starts with the first reading activity and ends with the last one: the execution phase starts immediately after this fake activity. This enables suspension between each pair of reading activities, and leaves the order in which input buffers are read to be decided at search time. The execution phase consists in the only execution activity if the task has no state; otherwise the read state, execution and write state activities are linked by loose precedence relations (figure 8B); this models the fact that at runtime read state/write state activities can be delayed by heavy bus traffic. Note however the processor is not released if such a delay occurs: a cover activity requiring a PE is used to represent this behavior. If a single outgoing arc is present, the corresponding write activity at runtime starts immediately after the execution phase (see figure 8C); once again in this case the two activities are actually glued together. If the task has more than one outgoing arc the adopted schema depends on whether the task is a branch or a fork. All the writing operations of a branch node are mutually exclusive: therefore they all start immediately after the execution phase, since they never appear in the same scenario. Writing activities of a fork node must all be per-

28

formed after the execution phase with no suspension, in a non specified order: a cover activity of fixed duration, equal to the sum of the durations of all writing activities, constrains their sequence to start immediately after the execution phase and leave the order to be decided at search time. Finally, precedence relations due to data communication are modeled as loose precedence constraints between pairs of writing and reading activities corresponding to the same arc (see figure 8D). Once again we stress that such complex precedence schema were adopted in order to provide an accurate model of the task runtime behavior, and was found to be crucial to achieve a low makespan prediction error (see section 8.3). 7.3.2 Resource constraints Both the processing elements and the bus are modeled as limited capacity resources, whose limit cannot be exceeded by overlapping non mutually exclusive tasks during the execution. On the contrary, mutually exclusive tasks can access the same resource at the same time without competition, since they never appear in the same scenario. Special resource constraints guarantee these properties to hold in the schedule. In particular the processing elements are unary resources (i.e. with unary capacity): therefore tasks requiring the same PE cannot overlap in time at all: we modeled them defining a simple disjunctive constraint proposed in [24], which enforces, for every two activities acti , actj requiring the same PE: p(task(acti ) ∧ task(actj )) = 0 ∨ end(acti ) ≤ start(actj ) ∨ end(actj ) ≤ start(acti ) where task(acti ) = ti for the queue/state reading, execution and queue/state writing activities of task ti . In practice, acti and actj cannot overlap in time, unless they never appear in the same scenario (p(task(acti ) ∧ task(actj )) = 0). In the problem we face, each activity acti related to task tj requires the PE tj is mapped to. As an exception, in the execution phase of tasks with state the PE is required by a cover activity (see figure 8B): this allows the state reading, execution, state writing activities to stretch without releasing the processor. The bus, as in [28], is modeled as a cumulative resource with capacity equal to its bandwidth, according with the so called “additive model”, which allows an error less than 10% until bandwidth usage is under 60% of the real capacity. Each activity requires an amount of bus bandwidth dependent on the size of the data to exchange and on its duration. A family of filtering algorithms for cumulative resource constraints are based on timetables, data structures storing the worst case usage profile of a resource over time [40]. While timetables for traditional resources are relatively simple and very efficient, computing the worst case usage profile in presence of alternative activities is not trivial at all. Suppose for instance we have the CTG in figure 9A, where for sake of simplicity each task corresponds to an activity; tasks t0 , . . . , t4 and t6 have already been scheduled: their start time and durations are reported in figure 9B; the bus has bandwidth 3, and the bandwidth requirement for each of the tasks is reported next to each of them in the graph. Tasks t5 and t7 have not yet been scheduled; t5 is present only in scenario ¬a, where the bus usage profile is the first one reported in figure 9B; on the other

29

hand, t7 is present only in scenario a, b, where the bus usage profile is the latter in figure 9B. Therefore the resource view at a given time depends on the activity we are considering. In case an activity is present in more than one scenario, the worst case has to be considered.

Fig. 9 Capacity of a cumulative resource on a CTG

In order to model the bus we introduce a new global timetable constraint for cumulative resources and conditional tasks in the non preemptive case. The global constraint keeps a list with the obligatory region of each activity; formally, if LST (acti ) ≤ EET (acti ), then the activity acti is said to have an obligatory region, meaning that a time region exists where the activity must execute. In the following, if an activity ti has non empty obligatory region, LST (acti ) is referred to as the entry point of the obligatory region, while EET (acti ) is referred to as the exit point. A special timetable data structure is used to store the resource usage over time; for a discrete resource with conditional activities, the usage at time t is defined as the highest usage over all scenarios, due to tasks whose obligatory region overlaps time t (LST (acti ) ≤ t < EET (acti )). Note for this reason the resource usage can only increase at an entry point, or decrease at an exit point. In particular, to ensure consistency it is sufficient to check the entry points of all activities. Whenever the obligatory region of an activity changes, all other activities acti are processed by means of the filtering algorithm described in Algorithm 3. In the description, cap is the bus bandwidth, req(acti ) is the bandwidth requirement of activity acti , usage(t) is a function returning the resource usage at time t. The algorithm scans the interval [EST (ti ), f inish) checking whether the resource usage is sufficient for a time period long enough to enable scheduling of acti . First the resource usage is checked at time t = EST (acti ) (line 2). If the resource is not overused, t is a candidate time to schedule acti (candStart = t) and the algorithm starts to check the following entry points (good = true); otherwise, no candidate scheduling time is defined and an exit point must be found where the resource usage is low enough. The main algorithm loop has therefore two terminating conditions (line 9):

30

if a candidate start time has been found (good = true), the process stops if the resource usage has been sufficient for a period long enough (t ≥ f inish); if instead good = f alse, we stop when there is no possible start time left for acti (t ≥ LST (acti )). In lines 10 to 14 the resource usage at time t is computed, adding the requirement of current activity if t is outside the obligatory region. If a resource overusage is detected when scanning entry points (line 17), then the algorithm starts to look for another candidate start time. If the resource capacity is sufficient to schedule acti when searching for a candidate start time (line 19), then the algorithm starts to check entry points and the finish time is updated (line 21). Finally, the algorithm moves to the next entry/exit point (lines 23 to 27). Once the process is over, either a lower bound has been computed for the start of acti (line 32), or acti cannot be scheduled (line 32).

Algorithm 3 Timetable filtering with alternative activities (A3) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33:

t = EST (acti ), f inish = EET (acti ) if usage(t) + req(acti ) ≤ cap then good = true candStart = t t = next entry point else good = f alse candStart = ∞ t = next exit point end if while ¬ [(good = f alse ∧ t > LST (acti )) ∨ (good = true ∧ t ≥ f inish)] do if t < LST (acti ) then usg = usage(t) + req(acti ) else usg = usage(t) end if if good = true ∧ usg > cap then good = f alse, candStart = ∞ else if good = f alse ∧ usg ≤ cap then good = true, candStart = t f inish = max(f inish, t + DU R(acti )) end if if good = true then t = next entry point else t = next exit point end if end while if good = true then EST (acti ) = candStart else f ail end if

Computing the usage(t) function requires to find the heaviest set of non alternative activities using the resource; if only pairwise mutual exclusion relations are considered as for the unary resource case, this amounts to solve the (NP-complete) maximum weight clique problem on the compatibility graph. Efficient resource usage computation is enabled here by a specific data structure named Branch Fork Graph (BFG), introduced in [27]. The BFG takes advantage of structure and properties of CTGs to

31

allow resource usage computation in polynomial time O(nb + nc ), provided the graph satisfies a condition named Control Flow Uniqueness. Algorithm 3 considers the entry/exit point of O(na ) activities, at each step the resource usage has to be computed once; therefore the overall complexity is O(nact (no + |TB |)), where nact is the number of activities, |TB | the number of branches, no the number of condition outcomes. The algorithm can be easily extended to update also LET (acti ). 7.3.3 Control Flow Uniqueness We are interested in conditional graphs satisfying Control Flow Uniqueness (CFU) a condition introduced in [31]. CFU is satisfied if each “and” node has a main ingoing arc arci , such that in every scenario where arci is active, also all the other ingoing arcs are active. In practice CFU requires each and-node to be triggered by a single “main” predecessor. For example in figure 10A, task t5 is sufficient to trigger the execution of t8 (since t7 executes in all scenarios) and thus CFU holds. On the opposite, in figure 10B, neither t4 nor t5 alone are sufficient to activate t7 and CFU is not satisfied.

Fig. 10 A: a CTG which satisfies CFU B: a CTG which does not satisfy CFU

In many practical cases CFU is not a restrictive assumption: for example, when the graph results from the parsing of a computer program written in a high level language (such as C++, Java, C#) CFU is naturally enforced by the scope rules of the language. In particular, CFU is a weaker condition compared to that of having structured graphs [53], that is graphs with a single collector node for each conditional branch: in particular CFU allows for graphs with multiple “tail” tasks (with no successor).

7.4 Benders cuts and subproblem relaxation Each time the master problem solution is not feasible for the scheduling subproblem, a cut is generated in order to forbid such solution; moreover, all solutions obtained by permutation of PEs are forbidden as well. In the literature, such a cut is usually referred to as “nogood”. Although using nogoods is sufficient for the process to converge, the pace of such convergence is very slow and in practice stronger cuts are needed. In the method we propose, such cuts are generated by solving relaxed scheduling problems where resource constraints related to all PE but one are removed; if still there is no solution, the set of

32

tasks allocated to that PE cannot be allocated as a whole to any PE. More formally, let τ be the identified set of tasks: X X Tij ≤ |τ | − 1 ∀j = 0, . . . , np − 1 : Tij − ti ∈τ

ti ∈τ /

P where we ensure the set of tasks τ is not allocated to PE j ( ti ∈τ Tij must be less than |τ |), unless some task not in τ is allocated on PE j, as this may allow local allocation of some communication buffer, resulting in shorter durations of read/write activities. The cut is very effective, but the relaxed subproblem it requires to solve is NPhard as well; note however this is in practice much easier than solving the allocation subproblem one more time: it is therefore worthwhile to generate the strong cuts if at least one iteration can be spared; the impact and the cost of the cuts has been experimentally evaluated in section 8.2. Second, a common problem with Logic Benders’ Decomposition approaches is that decoupling makes the master problem somehow blind with regards to the subproblem: as an example in the considered system the allocation provided could be trivially non schedulable. This issue can be solved by introducing in the master problem a subproblem relaxation [18]. In our model, the relaxation consists of two types of constraints, respectively forbidding the cumulative length of a path in the graph and the total duration of all tasks allocated on a single PE to exceed the deadline. The constraint formulation follows: 1. Path based constraints For each sequence of communicating tasks (path) π = ti0 , ti1 , . . .:  np −1 

X X durij (Mij , Sij ) +  

j=0

ti ∈π

X



th , tk ∈ π ar = (th , tk )



  rd wr durrj (Crj ) + durrj (Crj )  ≤ deadline 

(9)

That is, memory devices cannot be allocated in such a way that the total duration of a path is greater than the deadline. The linear functions durij (Mij , Sij ) and rd wr durrj (Crj ), durrj (Crj ) represent the durations of all the task/arc related activities w.r.t. to a specific PE j; formally:  durij (Mij , Sij ) =Die · Tij − (Die − dei ) Tij − Mij +   + Dirs Tij − Stij + Diws Tij − Stij where Die , dei are the duration of the execution activity with remote and local memory allocation and Dirs , Diws are the state reading/writing durations with remote state allocation (the duration is 0 in case of local state allocation); then, with ar = (th , tk ):  rd durrj (Crj ) =Drrd Tkj − Crj  wr durrj (Crj ) =Drwr Thj − Crj

where Drrd , Drwr are the queue read/writing duration with remote memory allocation (the durations are 0 in case of local queue allocation). That is, if task ti is allocated to PE j, then durij (Mij , Stij ) is the cumulative duration of the state reading, execution and state writing activities of ti ; otherwise durij (Mij , Stij ) = 0 (note Stij , Mij can be 1 only if Tij is 1). By summing over j (as

33 rd done in (9)) one gets the non PE-dependent duration of a task. Functions durrj (Crj ) wr and durrj (Crj ) are the analogous of durij (Mij , Stij ) for arc related activities.

2. Set based constraints For each set of non mutually exclusive tasks τ = {ti0 , ti1 , . . .}, we post the following constraint with j = 0, . . . np − 1: 



wr rd  durrj (Crj ) durrj (Crj ) + durij (Mij , Stij ) + ti ∈τ ar =(−,ti ) ar =(ti ,−)

X

X

X

≤ deadline (10)

rd wr where durij (Mij , Stij ), durrj (Crj ) and durrj (Crj ) are the same as above. This second type of cuts prevents the total duration of non mutually exclusive groups of activities on PE j from exceeding the deadline, since non mutually exclusive tasks cannot overlap and must execute in sequence if they are on the same PE. Note that both the number of paths and the number of sets of non mutually exclusive tasks are exponential: in order to deal with this, relaxation cuts are dynamically added during search; note this is not a critical issue, as the subproblem relaxation is added for efficiency purpose and is not required for the approach to be sound. In particular, before each allocation step:

1. sets of tasks such that their cumulative worst case duration exceeds the deadline are identified 2. constraints (9) and (10) are posted At the moment, the selection is done by solving a very simple constraint problem with random variable/value selection: the method could be improved by introducing a suitable heuristic to select more effective constraints at each iteration.

8 Experimental Results We have performed three kinds of experiments, namely (i) computational efficiency of our optimization algorithms, (ii) comparison of simulated throughput with optimizerderived values, and (iii) prove of viability of the proposed approach for real-life demonstrators (GSM, Software Radio).

8.1 Instance generation We tested the method on two sets of instances1 : the first set contains 178 graphs from a similar problem with no conditional edges; the graphs were randomly generated, then wrapped in a synthetic benchmark and finally annotated with computation times and branch probabilities via repeated simulation. Instances of this first group are only slightly structured, i.e. they have very short tracks and quite often contain singleton nodes: therefore we decided to generate a second group of 40 instances, completely structured (one head, one tail), featuring many precedence relations and branch/fork nodes with variable fan-out (2 to 4). On this purpose a novel generator was devised, with the ability of building realistic, CFU 1

All instances are available at http://www.lia.deis.unibo.it/Staff/MicheleLombardi

34

compliant graphs and handling dependencies between node attributes 2 . Durations and branch probabilities are in this case randomly generated, based on data available from the first set of instances. A single deadline constraint has to be met by the makespan of the final schedule and the release time for all tasks is 0; this is quite common in embedded system literature (see [53, 9]). The deadline value was computed for each instance by properly choosing and stretching a lower bound on the makespan. The deadline value for the first group of instances is set to: P DU R(ti ) 1 dl = × i υ np where DU R(ti ) denotes the sum of the durations of all activities related to ti , np is the number of processing elements and υ ∈ (0, 1] is a utilization factor which was set in this case to 0.85. The rationale for the above formula is that the small number of precedence relations makes the allocation and scheduling problem quite close to bin P packing, for which a lower bound is given by i DU R(ti )/np . A different formula was used to compute the deadline for the second group of instances; due to the presence of a single head task, a single tail and many precedence relations, the duration of a the longest path is a good lower bound on the makespan; let ts be the source task and tt be the tail, then: dl = µ ×

max

π={ts ,...,tk ...,tt }

X

DU R(ti )

ti ∈π

where π is a path from ts to tt and µ ≥ 1 is a multiplier, which was set to 1.25 for the instances at hand. Observe that the number of processing elements is not taken into account: in fact, each graph is mapped to a generated platform containing a number PEs roughly proportional (by a constant factor) to the number of tasks in the graph; hence the number of PEs is implicitly taken into account by the µ coefficient. Both υ and µ were chosen to produce a reasonable number of infeasible instances from the test sets.

8.2 Computational Efficiency We implemented all exposed algorithms in C++, using the state of the art solvers ILOG Cplex 9.0 (for ILP) and ILOG Solver 6.0 and Scheduler 6.0 (for CP). We tested all instances on a Pentium IV 2GHz with 512MB RAM. The time limit for the solution process was 20 minutes. The results of the experiments on the first group are summarized in table 1. Instances are grouped by tens according to the number of activities (acts); beside this, the table reports also the number of nodes (NN), arcs (NA), processing elements (PEs); statistics about the solution time follow: in particular the mean time to analyze the graph (init) is given, together with the time to solve the master and the subproblem, to generate the no-good cuts and the mean overall time (in seconds); the mean number of iterations (it) is also reported. The solution times are of the same order of the deterministic case (scheduling of Task Graphs), which is a very good result, since we are working on conditional task graphs and thus dealing with a stochastic problem. 2

A description of the generator is available at http://www.lia.deis.unibo.it/Staff/MicheleLombardi

35

The last two columns report the number of infeasible instances for each group (inf) and the number of reported timeouts, split in two classes depending on whether the hardest subproblem was the allocation (A) or computing a schedule (S). The solution time of these instances is not taken into account in the mean. The second column (UP/UB) reports the average values of resource usage indicators; in particular, a PE usage indicator (UP) is computed for solved instances by summing up all task durations over the solution makespan, and normalizing by the number of processing elements; more formally: UP =

1 X DU R(ti ) np mkspan i

where DU R(ti ) is the duration of task ti and mkspan is the solution makespan. The U P indicator gives an idea of the stringency of conditional unary constraints; note mutual exclusions are not taken into account in the computation: hence U P value can be (and usually is) higher than 1; similarly, a bus usage indicator (UB) is computed as the average bus usage over the solution makespan, weighted by each activity duration and normalized over the bus total bandwidth: UB =

X req(acti ) · DU R(acti ) mkspan · bandwidth i

where DU R(acti ) is the duration of activity acti (execution, read/write state, read/write queues), req(acti ) is the corresponding bus usage and bandwidth is the bus bandwidth. Again, mutual exclusions are not taken into account, making U B an indicator of the global tightness of the conditional timetable constraint. acts 10-14 14-17 17-21 21-23 23-24 24-26 26-27 27-30 30-33 34-36 36-39 40-45 45-51 51-55 55-57 57-60 60-68 70-80

UP/UB 1.88/0.01 1.94/0.02 1.80/0.03 1.83/0.03 1.82/0.03 1.72/0.04 1.44/0.04 1.61/0.06 1.68/0.04 1.60/0.06 1.33/0.07 1.35/0.06 1.29/0.09 1.20/0.07 1.17/0.07 1.17/0.10 1.14/0.07 1.10/0.05

NN/NA PEs init master sub 4-8/2-3 2 0.004 0.007 0.001 6-10/3-4 2 0.006 0.009 0.002 7-12/4-5 2-3 0.008 0.010 0.005 7-11/5-6 2-3 0.009 0.026 0.007 7-12/5-7 2-3 0.009 0.040 0.010 8-13/6-7 2-3 0.014 0.049 0.008 8-14/6-7 2-3 0.016 0.073 0.006 9-12/7-9 2-3 0.013 0.063 0.043 9-14/8-11 3-4 0.010 0.088 14.884 10-15/8-12 3-4 0.015 0.074 0.009 11-15/10-13 3-4 0.013 0.263 0.071 11-18/10-15 3-4 0.026 3.139 0.067 12-19/13-15 3-4 0.020 0.808 1.031 15-21/15 4-5 0.086 0.792 0.421 17-22/15 4-5 0.087 44.579 38.939 19-28/15 4-5 0.576 1.108 41.154 24-36/15 5-6 0.447 70.737 0.017 37-50/15 6 1.550 0.675 0.025

nogd 0.001 0.003 0.000 0.000 0.011 0.000 0.012 0.054 0.004 0.000 0.020 0.267 0.036 0.081 0.047 0.034 0.000 0.000

time 0.013 0.020 0.023 0.041 0.070 0.070 0.107 0.173 14.985 0.098 0.367 3.499 1.895 1.380 83.651 42.872 71.200 2.250

it inf A/S 1.100 1 0/0 1.200 1 0/0 1.000 0 0/0 1.000 1 0/0 1.400 3 0/0 1.000 2 0/0 1.400 1 0/0 2.400 3 0/0 1.200 2 0/0 1.000 2 0/0 1.200 3 0/0 7.800 3 0/0 1.700 2 0/0 1.500 0 0/0 1.800 0 0/1 1.400 0 1/0 1.000 0 2/2 1.000 0 2/3

Table 1 Results of the tests on the first group of instances (slightly structured)

The worst case complexity of the allocation problem grows with the number of nodes and arcs; as for the scheduling note that, despite all reading/writing activities related to a single task are tightly connected, the order in which they must be performed

36

has to be decided at search time: hence the complexity of the scheduling subproblem is better expressed in terms of the number of activities. Note that in most cases computing an allocation is harder than solving the scheduling problem: this was expected, as the master problem requires to find an optimal solution, whereas hitting a feasible one is sufficient in the subproblem. Moreover, scheduling a set of activities with fixed duration and resource constraints is a very well known problem in CP, for which widely used and very efficient solution methods exist. Perhaps counter intuitively, the scheduling problem appears to be harder for the little structured instances (in Table 1). In this case the very small number of precedence relations makes it hard for the CP solver to evaluate the makespan stretch due to not yet ordered activities, hindering the computation of good bounds. In such a situation, an early scheduling decision preventing the satisfaction of the deadline constraint is detected only late in the search, leading to solution time blowup; this could be solved by adding redundant propagation to improve the makespan bound. Alternatively, impact of early bad choices can be probabilistically reduced by adopting a randomized goal and performing frequent restarts [6]; we therefore randomized the “schedule or postpone” strategy used in the CP solver, and tried to solve with restarts all scheduling dominated timed out instances: in all but one case the problem was solved within the time limit. Note that the number of iterations is quite low: this is due to a positive interaction between the two problem components. Since the presence of a subproblem relaxation prevents the allocator from packing all tasks on a single PE, the bus traffic is reduced by locally allocating as many memory requirements as possible: this also reduces task durations, so that the produced allocations have a good chance to yield low makespan schedules. Finally, in the computed schedules the processing elements seem to be the most stressed resources: this is a quite natural consequence of choosing the minimization of the bus traffic as objective function for the allocation subproblem. The very low UB values reported reflect the very scattered bus usage profile observed in the solutions, with sporadic peaks interleaved by low-utilization periods. The normalized bus usage profile (of an instance with a relatively high bus usage) is shown in Figure 11, exhibiting typical short peaks interleaved with much wider valleys; the time unit in the graph (on the x-axis) is one clock cycle. The results of the second group of instances (completely structured) are reported in table 2. In this case the higher number of arcs (and thus of precedence constraints) improves the CP solver propagation and makes the scheduling problem much more stable: no time out due to the scheduling problem is reported. On the other hand the increased number of arcs makes the allocation more complex and the scheduling problem approximation less tight, thus increasing the number of iterations and their duration. We also performed a set of tests to verify the effectiveness of the cuts we proposed in section 7.4 with respect to the basic cuts removing only the solution just found: acts 21-41 41-59 60-79 81-102

UP/UB NN/NA PEs init master sub nogd time 1.20/0.03 6-10/7-14 2-3 0.01 0.26 0.03 0.06 0.36 0.86/0.04 10-15/14-19 3-4 0.01 0.34 0.08 0.15 0.57 0.55/0.04 14-20/20-27 4-5 0.01 4.94 0.05 0.16 5.16 0.51/0.05 20-25/28-35 5-6 0.04 107.81 3.52 2.49 113.85

it inf A/S 9.1 2 0/0 3.9 0 0/0 9.3 1 1/0 17.7 0 0/0

Table 2 Result of the tests on the second group of instances (completely structured)

37

Fig. 11 A typical bus usage profile; clock cycles are on the x-axis, while normalized bus usage is on the y-axis

table 3 reports results for a 34 activities instance repeatedly solved with decreasing deadline values, until the problem becomes infeasible. The iteration number greatly reduces. Also, despite the mean time to generate a cut grows by a factor of ten, the overall solving time per instance is definitely advantageous with the tighter cuts. Finally, to estimate the quality of the chosen objective function (bus traffic expected value), we tested it against an optimal solver combined with a heuristic technique of deterministic reduction. The chosen heuristic simply optimizes bus traffic for the scenario when each branch is assigned the most likely outcome (this is the most probable scenario); despite its simplicity, this is a relevant technique, since it is actually used in modern compilers [38]. We ran tests on three instances: we solved them with our method and the heuristic one (obtaining two different allocations), then, for each execution scenario, the resulting bus traffic values were computed for both approaches and compared. Note that

mean time to gen. a cut basic case: with relaxation based cuts (RBC): number of iterations execution time deadline basic case with RBC basic case with RBC 8557573 2 3 1.18 0.609 625918 1 1 0.771 0.765 590846 1 1 0.562 0.592 473108 19 6 6.169 1.186 464512 190 14 201.124 9.032 454268 195 24 331.449 10.189 444444 78 15 60.747 6.144 433330 9 4 4.396 1.657 430835 5 3 3.347 1.046 430490 5 3 3.896 1.703 427251 3 2 2.153 0.188

0.0074 0.0499 result opt. found opt. found opt. found opt. found opt. found opt. found opt. found opt. found opt. found opt. found inf.

Table 3 Number of iterations without and with scheduling relaxation based cuts

38

by definition the solver using the heuristic reduction 1) computes the best possible allocation for at the most likely scenario 2) allocates all tasks, but does not take into account the traffic produced by tasks not present in the most likely scenario; nevertheless even those tasks are often given local memory allocation, as it increases the chance of meeting the deadline. On the contrary the conditional solver takes into account all scenarios at the same time, weighted by their respective probability; note however that the produced conditional solutions have to guarantee to be optimal for any single scenario. The results of the comparison are shown in table 4, where for each instance are reported the mean, minimum and maximum quality improvement against the heuristic reduction method. Note that on the average our method always improves the heuristic solution, it performs a little worse on a few scenarios (e.g. the most probable scenario) and considerably better in many other cases (see the column with maximum improvements). quality improvement instance activities scenarios mean min max 1 53 10 4.72% -0.88% 13.08% 2 57 10 2.59% -0.11% 8.82% 3 54 24 12.65% -0.72% 39.22% Table 4 Comparison with heuristic deterministic reduction

8.3 Validation of optimizer solutions We have deployed the virtual platform to implement the allocations and schedules generated by the optimizer, and we have measured deviations of the simulated throughput from the predicted one for 30 problem instances. A synthetic benchmark has been used for this experiment, allowing to change system and application parameters (local memory size, execution times, data size, etc.). We want to make sure that the modeling approximations are not so rough to significantly impact the accuracy of the results with respect to real-life systems.

Fig. 12 Difference in execution time

39

Fig. 13 Probability for throughput differences

The results of the validation phase are reported in figure 12 and figure 13. Figure 12 shows the differences in execution time between the predicted one by the optimizer and the real one by the cycle accurate simulator. It can be noticed that the differences are marginal and we point out that all the deadline constraints are satisfied. Figure 13 shows the probability for throughput differences between optimizer and simulator results. The average difference between measured and predicted values is 4.8%, with 2.41 standard deviation. This confirms the high level of accuracy achieved by the developed optimization framework, thanks to the calibration of system model parameters against functional timing-accurate simulation and to the control of system working conditions. We also performed a further validation step on two real world applications (GSM encoder and SW radio), described in detail in [11].

9 Conclusions We target allocation and scheduling of conditional multi-task applications on top of distributed memory architectures with messaging support. We tackle the complexity of the problem by means of decomposition and no-good generation, and introduce a software library and API for the reliable software deployment. Moreover, we propose an entire innovative framework to help programmers in software implementation and deploy a virtual platform to validate the results of the development framework and to check modelling assumptions of optimizer, showing a very high level of accuracy. Our methodology can potentially contribute to the advance in the field of software optimization and development tools for highly integrated on-chip multiprocessors. Future developments could include the introduction of memory overlaying (i.e. allocation of the same area to different, non overlapping, tasks) in the runtime support to improve utilization of memory devices; this requires modifications both of the allocation and in the scheduling model, in order to track memory usage over time. Note however that, as embedded systems applications are often repeating, overlaying cannot be used for state memory or static computation data shared between different iterations. Platforms featuring heterogeneous processing elements (e.g. general purpose core and DSP engines), hardware limitations on the mapping choices for specific tasks as well as task specific deadlines can be taken into account with only minor modifications to

40

the optimization phase; the impact of such modifications on the solver performance has to be investigated. Finally other objective functions, such as the expected completion time could be addressed to broaden the scope of the approach.

References 1. A. Bender, MILP based task mapping for heterogeneous multiprocessor systems. Proceedings of the European Design and Automation Conference, EURO-DAC ’96 - EUROVHDL ’96 : 197–, 1996. 2. A. Eremin and M. Wallace, Hybrid benders decomposition algorithms in constraint logic programming. Proceedings of the International Conference on Principles and Practice of Constraint Programming, CP 2001 : 1–15, 2001. 3. ARM Semiconductor, ARM11 MPCore Multiprocessor. Available at http://arm.convergencepromotions.com/catalog/753.htm. 4. ARM Semiconductor, Arm11 mpcore multiprocessor. Available at http://arm.convergencepromotions.com/catalog/753.htm. 5. C. Le Pape, D. Vergamini, V. Gosselin, Time-versus-capacity compromises in project scheduling. Proceedings of the Thirteenth Workshop of the U.K. Planning Special Interest Group : 19–, 1994. 6. C. Gomes, B. Selman, K. McAloon, and C. Tretkoff, Randomization in backtrack search: Exploiting heavy-tailed profiles for solving hard scheduling problems. Proceedings of the International Conference on AI Planning and Scheduling, AIPS 98 : 208–213, 1998. 7. Cradle Technologies, The multi-core DSP advantage for multimedia. Available at http://www.cradle.com/. 8. D. Pham et al., The design and implementation of a first-generation CELL processor. Proceedings of the International Solid State Circuits Conference, ISSCC ’05 : 45–49, 2005. 9. D. Shin, J. Kim, Power-aware scheduling of conditional task graphs in real-time multiprocessor systems. Proceedings of the International Symposium on Low Power Electronics and Design, ISLPED 2003 : 408–413, 2003. 10. D. Wu, B. Al-Hashimi, and P. Eles, Scheduling and mapping of conditional task graph for the synthesis of low power embedded systems. Computers and Digital Techniques, IEEE Proceedings, volume 150 (5) : 262–273, 2003. 11. E. Dolif and M. Lombardi and M. Ruggiero and M. Milano and L. Benini, CommunicationAware Stochastic Allocation and Scheduling Framework for Conditional Task Graphs in Multi-Processor Systems-on-Chip Proceedings of the International Conference on Embedded Software, EMSOFT ’07 : 56–, 2007. 12. E. S. Thorsteinsson, Branch-and-check: A hybrid framework integrating mixed integer programming and constraint logic programming. Proceedings of the International Conference on Principles and Practice of Constraint Programming, CP ’01 : 16–30, 2001. 13. G. Laporte and F. Louveaux, The integer l-shaped method for stochastic integer programs with complete recourse. Operations Research Letters : 133–142, 1993. 14. G. Martin, Overview of the MPSoC design challenge Proceedings of the 43rd annual conference on Design automation, DAC ’06, 274–279, 2006. 15. Intel Corporation Intel IXP2800 Network Processor Product Brief, 2002. Available at http://download.intel.com/design/network/ProdBrf/27905403.pdf. 16. J. Axelsson, Architecture synthesis and partitioning of real-time systems: A comparison of three heuristic search strategies. Proceedings of the International Conference Hardware/Software Codesign and System Synthesis, CODES ’97, 161–, 1997. 17. J. Benders, Partitioning procedures for solving mixed-variables programming problems. Numerische Mathematik, volume 4 : 238–252, 1962. 18. J. N. Hooker, A hybrid method for planning and scheduling. Constraints, volume 10 (4) : 385–401, 2005. 19. J. N. Hooker, Planning and Scheduling to Minimize Tardiness. Proceedings of the International Conference on Principles and Practice of Constraint Programming, CP2005 : 314–327, 2005. 20. J. R. Graham, Integrating parallel programming techniques into traditional computer science curricula ACM SIGCSE Bullettin, volume 39 (4) : 75–78, 2007. 21. J.N.Hooker and G.Ottosson, Logic-based Benders decomposition. Mathematical Programming volume 96 (1): 33-60, 2003.

41 22. John Goodacre and Andrew N. Sloss, Parallelism and the ARM Instruction Set Architecture. Journal Computer, volume 38 (7) : 42–50, 2005. 23. K. Kuchcinski, Embedded system synthesis by timing constraint solving. Proceedings of IEEE ISSS ’97 : 50–57, 1997. 24. K. Kuchcinski, Constraints-driven scheduling and resource assignment. ACM Transactions on Design Automation of Electronic Systems, 2003 25. K. Kuchcinski, C. Wolinski, Global Approach to Assignment and Scheduling of Complex Behaviors based on HCDG and Constraint Programming. Journal of Systems Architecture, 49 (12–15): 489–503, 2003. 26. K. S. Chatha and R. Vemuri, Hardware-software partitioning and pipelined scheduling of transformative applications. IEEE Transactions on Very Large Scale Integrated Systems, 10 (3): 193–208, 2002. 27. L. Benini, M. Lombardi, M. Milano, M. Ruggiero, A constraint programming approach for allocation and scheduling on the cell broadband engine. Proceedings of the International Conference on Principles and Practice of Constraint Programming, CP 2008 : 21–35, 2008. 28. L. Benini, D. Bertozzi, A. Guerri, and M. Milano, Allocation and scheduling for MPSoCs via decomposition and no-good generation. Proceedings of the International Conference on Principles and Practice of Constraint Programming, CP2005 : 107–121, 2005. 29. L. Benini, D. Bertozzi, A. Guerri, M. Milano, Allocation, scheduling and voltage scaling on energy aware MPSoCs. Proceedings of International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, CPAIOR 2006 : 44–58, 2006. 30. L. Benini, M. Lombardi, M. Mantovani, M. Milano, M. Ruggiero, Multi-stage benders decomposition for optimizing multicore architectures. Proceedings of International Conference on Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, CPAIOR 2008 : 36–50, 2008. 31. M. Lombardi and M. Milano, Stochastic Allocation and Scheduling for Conditional Task Graphs in MPSoCs. Proceedings of the International Conference on Principles and Practice of Constraint Programming, CP2006 : 299–313, 2006. 32. M. Horowitz et al, The Future of Wires Proc. IEEE, volume 89 : 490–504, 2001. 33. M. Horowitz, Scaling, Power and the Future of CMOS Proceedings of the 20th International Conference on VLSI Design, VLSID ’07 : 7–, 2007. 34. M. Ruggiero, A. Guerri, D. Bertozzi, F. Poletti, and M. Milano, Communication-aware allocation and scheduling framework for stream-oriented multi-processor systems-on-chip. Proceedings of the conference on Design, automation and test in Europe, DATE ’06 : 8–, 2006. 35. P. Baptiste, C. Le Pape, W. Nuijten, Constraint-Based Scheduling. Kluwer Academic Publisher, 2003. 36. P. Laborie, Complete MCS-Based Search: Application to Resource Constrained Project Scheduling. Proceedings of the International Joint Conferences on Artificial Intelligence, IJCAI 2005, volume 19 : 181–, 2005. 37. P. Eles, K. Kuchcinski, Z. Peng, A. Doboli, and P. Pop, Scheduling of conditional process graphs for the synthesis of embedded systems. Proceedings of the conference on Design, automation and test in Europe, DATE ’98 : 15–29, 1998. 38. P. Faraboschi, J. Fisher, and C. Young, Instruction scheduling for instruction level parallel processors. Proceedings of the IEEE, volume 89 (11) : 1638–1659, 2001. 39. P. Francesco, P. Antonio, and P. Marchal, Flexible hardware/software support for message passing on a distributed shared memory architecture. Proceedings of the conference on Design, automation and test in Europe, DATE ’05 : 736–741, 2005. 40. P. Laborie, Algorithms for propagating resource constraints in AI planning and scheduling: Existing approaches and new results. Artificial Intelligence, volume 143 (2): 151-188, 2003. 41. P. Palazzari, L. Baldini, and M. Coli, Synthesis of pipelined systems for the contemporaneous execution of periodic and aperiodic tasks with hard real-time constraints. Proceedings of the IEEE International Parallel & Distributed Processing Symposium, 2004. 42. R. Szymanek and K. Kuchcinski, A constructive algorithm for memory-aware task assignment and scheduling. Proceedings of the International Conference on Hardware-Software Codesign and System Synthesis, CODES ’01 : 152–, 2001. 43. R. W. Brodersen and M. A. Horowitz and D. Markovic and B. Nikolic and V. Stojanovic, Methods for true power minimization Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design, ICCAD ’02 : 35–42, 2002.

42 44. Rtems home page, at http://www.rtems.com. 45. S. Borkar, Thousand core chips: a technology perspective DAC ’07: Proceedings of the 44th annual conference on Design automation : 746–749, 2007. 46. S. Borkar, Design Challenges of Technology Scaling IEEE Micro, volume 19 (4) : 23–29, 1999. 47. S. Kodase, S. Wang, Z. Gu, and K. G. Shin, Improving scalability of task allocation and scheduling in large distributed real-time systems using shared buffers. Proceedings of the IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 03 : 181–, 2003. 48. S. Medardoni and M. Ruggiero and D. Bertozzi and L. Benini and G. Strano and C. Pistritto, Interactive presentation: Capturing the interaction of the communication, memory and I/O subsystems in memory-centric industrial MPSoC platforms Proceedings of the conference on Design, automation and test in Europe, DATE ’07 : 665–, 2007. 49. S. Prakash and A. C. Parker. Sos: synthesis of application-specific heterogeneous multiprocessor systems. Journal of Parallel and Distributed computing, volume 16 (4) : 338–351, 1992 50. T. Mudge, Power: A First-Class Architectural Design Constraint Computer, volume 34 (4) : 52–58, 2001. 51. V. Jain and I. E. Grossmann. Algorithms for hybrid milp/cp models for a class of optimization problems. INFORMS Journal on Computing, volume 13 (4) : 258–276, 2001. 52. W. Brunnbauer, T. Wild, J. Foag, N. Pazos, A constructive algorithm with look-ahead for mapping and scheduling of task graphs with conditional edges. Proceedings of the Euromicro Conference on Digital System Design, DSD 2003 : 98–103, 2003. 53. Y. Xie, W. Wolf, Allocation and scheduling of conditional task graph in hardware/software co-synthesis. Proceedings of the conference on Design, automation and test in Europe, DATE 2001 : 620–625, 2001. 54. AMD (Advanced Micro Devices) http://multicore.amd.com/us-en/AMD-Multi-Core.aspx. 55. CISCO Systems http://www.cisco.com/en/US/products/ps5763/. 56. NEC http://www.nec.co.jp/techrep/en/journal/g06/n03/060311.html. 57. ST Microelectronics http://www.st.com/stonline/products/families/mobile/processors/processorsprod.htm.

Appendix A - SM matrix computation This section describes how to compute the sequence matrix SM of a given Conditional Task Graph; the main idea is that, being the input CTG acyclic, the computation can be performed in a recursive fashion. More in detail we can build the matrix incrementally by visiting the graph in topological order, provided we: a) know what SM is at the root node b) know how to update the sequence matrix at node ti in terms of SM for all the parent nodes tj . In the following, we refer as SM(ti ) to the state of SM in the incremental computation after node ti has been visited; similarly we denote with SMhk (ti ) the value of the h, k cell in SM after ti has been visited. Since immediately after the root node is visited no outcome has been encountered yet, we have: SMhk (root) = 0

∀h, k = 0 . . . no , h 6= k

Therefore, in order to provide a method to compute the final SM matrix, we just have to specify, for each node ti with parents t′ , t′′ , . . ., how to compute SM(ti ) in terms of SM(t′ ), SM(t′′ ), . . . (see point “b” in the enumeration above). On this purpose it is useful

43

to introduce the notions of activation set of an arc and sequence matrix state of an arc during the computation (SM(ar ), corresponding to SM(ti )). Activation set of an arc We define the activation set of an arc (let it be AS(ar = (ti , tj )) as the activation set of the source node ti , possibly augmented with the outcome labeling the arc; more formally:  AS(ti ) if ti ∈ TF (ti is a fork node) AS(ar = (ti , tj )) = AS(ti ) ∪ Outij if ti ∈ TB (ti is a branch node) obviously the computation of AS(ar = (ti , tj )) runs in O(no ). Sequence matrix state of an arc Given SM(ti ) (the sequence matrix state after the visit of ti ), SM(ar ) with ar = (ti , tj ) is defined as follows: SMhk = SMkh =



1 if Outh ∈ AS(ti ) and Outk labels ar SMhk (ti ) otherwise

with 0 ≤ h < k ≤ no . In practice SM(ar ) is the same as SM(ti ) if ar is not conditional; conversely, if ar is labeled with Outk , a “1” has to be added for each cell corresponding to Outk and any outcome from AS(ti ); basically, no outcome in AS(ti ) is by itself sufficient to set arc ar state to true, but they all need Outk (and hence the update of SM, according to Definition 4). Note that the computation requires to set to “1” a number O(|AS(ti )|) of cells. Main step of the incremental computation The main recursive step to compute SM(ti ) depends on whether ti is an and-node or an or-node; the type (and/or) of a node ti can be checked by running A2 on all its parents: if they are pairwise mutually exclusive (i.e. the result of A2 is ∅) ti is an or-node; if no two of them are mutually exclusive, then ti is an and-node. Observe that, as the graph is acyclic and the execution of a node depends only on its predecessors, algorithm A2 does not require the full set of graph data structure; in particular, to compute the coexistence set of ti and tj one only requires SM(ti ), SM(tj ), EM, AS(ti ) and AS(tj ). Case 1: Now, let ti be an or-node with parent nodes t′ , t′′ , then SM(ti ) can be computed by merging the sequence matrices computed at t′ and t′′ ; more formally: SMhk (ti ) = SMkh (ti ) = max{SMhk (a′ ), SMhk (a′′ )} where 0 ≤ h < k ≤ no and SM(a′ = (t′ , ti )) and SM(a′′ = (t′′ , ti )) is the state of SM corresponding to arcs (t′ , ti ) and (t′′ , ti ). The result generalizes to or-nodes with more than one parent. The rationale behind the operation is that, as ingoing arcs of an ornode are mutually exclusive, no pair of outcomes coming one from AS((t′ , ti )) and one from AS((t′′ , ti )) are required for ti to execute. Therefore, according to Definition 4 no new cell has to be set to 1 in SM(ti ). Note that the computation takes time O(n2o ). Case 2: In case ti is an and-node, the result of the merging of t′ and t′′ has to be augmented by: 1. considering all possible sets of outcomes σ (resp. ρ) sufficient to set to true the state of (t′ , ti )

44

2. setting SMhk = 1 for each pair of outcomes Outh , Outk (with h 6= k) respectively coming from a σ and a ρ subset Where, due to Theorem 1 a set σ ⊆ AS((t′ , ti )) sufficient to set to true the state of arc (t′ , ti ) is any set of outcomes Outh such that SMhk = 1 for every Outk ∈ C(σ), with h 6= k. The notation C(σ) is a straightforward extension of C(AS(ti )) (from Theorem 1) to general sets. Note however that any outcome in AS((t′ , ti )) is part of such a set σ and, similarly, any outcome in AS((t′′ , ti )) is part of such a set ρ; therefore, we can simply set to 1 every SMhk cell such that Outh comes from AS((t′ , ti )) and Outk comes from AS((t′′ , ti )). Formally, let a′ be the arc (t′ , ti ) and a′′ = (t′′ , ti ):  1 if Outh ∈ AS(a′ ) and Outh ∈ AS(a′′ ) SMhk = SMkh = max{SMhk (a′ ), SMhk (a′′ )} otherwise with 0 ≤ h < k ≤ no . Again, the result generalizes to and-nodes with more than one parent. As some constant-time operations are performed for each SM cell, the computation takes time O(n2o ). As 1) traversing the graph in topological order requires to visit each node once, 2) all described operations have to be performed a polynomially bounded number of times for each node and 3) all described operations have polynomial complexity, it follows the computing SM has polynomial complexity as well.

Appendix B - Proof of Theorem 1 The text of Theorem 1 is reported here to ease the reader: Theorem 1 Let AS(ti ) be the activation set of a task ti and let C(AS(ti )) be the set of conditions cj such that two or more outcomes of cj are in AS(ti ). Then the activation event AE(ti ) can be derived by: 1. Considering all consistent assignments of outcomes in AS(ti ) to conditions in C(AS(ti )); an assignment of outcomes is consistent if it is part of at least a scenario. Let us refer as Out to the assigned outcomes. 2. Extending each Out with the outcomes Outh such that SMhk = 1 for every Outk ∈ Out. Formally: Ext(Out) = Out ∪ {Outh ∈ AS(ti ) | SMhk = 1 ∀Outk ∈ Out} Then, it holds: AE(ti ) =

_

all consistent Out

 

^

Outh ∈Ext(Out)



Outh 

(11)

Since the activation event AE(ti ) by definition contains all the sets of outcomes necessary and sufficient for the execution of ti , we just have to show that the formula defined by (11) has the same content; this is done in the followings. (A) AE(ti ) and (11) have the same size: In first place, recall that any two distinct conjunctions in AE(ti ) must differ for some mutually exclusive outcomes in AS(ti ). Note that this is the same as to say that they must differ for the assignment of some

45

condition in C(AS(ti )). Hence |AE(ti )| is equal to the number of consistent assignments for conditions in C(AS(ti )) where “consistent” means “appearing in at least a scenario”; moreover, assignments Out serve as stems for the conjunctions in AE(ti ). (B) each Out is completed to a conjunction in AE(ti ): Therefore, we just have to prove that step 2 in Theorem 1 completes Out to a set of outcomes corresponding to a conjunction in AE(ti ); equivalently, we can prove that, in the hypothesis that V all outcomes in Out are true, then Outh (with Outh ∈ Ext(Out)) is necessary and sufficient for ti to execute. Formally:      ^ ^ Outk  ⇒ ti ∈ T G(s) ⇔  Outh  ∀s ∈ S :  Outk ∈Out(s)

Outh ∈Ext(Out)

(B1) Ext(Out) is sufficient: Let Outh be a outcome necessary for the execution of ti , given that Out occurs; this also means that Outh needs every outcome in Out for ti to execute and hence SMhk = 1 for all Outk ∈ Out. Hence all the necessary outcomes Outh are in Ext(Out). (B2) Ext(Out) is necessary: Let G be the deterministic subgraph consisting of the predecessors of ti (either direct or not) in T G(s) when Out is true in s. Note that all outcomes Outh ∈ Ext(Out) are in G, since they are not mutually exclusive with every Outh ∈ Out. Observe that, since a deterministic graph has and-nodes only, every outcome having a direct path to ti is needed for ti to execute. Now, since all outcomes Outh of Ext(Out) are in AS(ti ), there exists a direct path in G from the arc where Outh is and ti . As a consequence, Ext(Out) contains only outcomes needed for ti to execute in case Out occur.

Appendix C - Soundness of A1 Algorithm A1 takes as input a set of condition outcomes σ and outputs it probability. Such an input set must satisfy some property in order to be acceptable. First requirement on the input set: The first requirement on the input set can be stated as follows: Property 1 For each outcome Outij ∈ σ, the set must contain all outcomes collected on the graph by traversing all the paths from a node t′ , down to arc (ti , tj ). Such node t′ always executes when outcomes in σ are true. Note that different outcomes can refer to different starting nodes. Such a set σ represents the execution of the subgraphs between all the starting nodes t′ and the final destination arcs; for example, with reference to Figure 2, set σ1 = {¬a, b, ¬b} refers to the subgraph t1 , (t1 , t5 ), t5 , (t5 , t6 ), t6 , (t6 , t8 ), (t6 , t9 ), with a single reference node (i.e. t1 ); set σ2 = {¬a, d} refers instead to t1 , (t1 , t5 ), t12 , (t12 , t14 ), with two reference nodes (namely t1 and t12 ). Note that the execution of the reference nodes t′ is required for the whole subgraph to execute: hence, when speaking about the probability of σ, one actually refers to the conditional execution probability of the subgraph, given that all the starting nodes t′ execute, namely p(σ|t′ ); A1 is designed to compute such a conditional probability. With reference to the previous example, the probability of σ1 is p(σ1 | t1 ) = p(¬a) ·

46

(p(b) + p(¬b)) = 0.5 · (0.4 + 0.6) = 0.5, while the probability of σ2 is p(σ2 | t1 , t12 ) = p(¬a) · p(d) = 0.5 · 0.3 = 0.15. As a particular case, one can see that the activation set of a node ti always satisfies Property 1; moreover in that case all outcomes refer to a single starting node (the CTG root). Hence when σ = AS(ti ), the probability of σ is the conditional execution probability of the subgraph ending in ti , given that the root node executes; as the root node always executes, then we have p(AS(ti )) = p(ti | root) = p(ti ). Second requirement on the input set: Since basically any set of outcomes satisfies Property 1 with a proper choice of reference nodes, a second property is thus required for A1 to run correctly: Property 2 If outcomes in σ refer to more nodes t′ , t′′ , . . . (in the sense of Property 1), then nodes t′ , t′′ , . . . must be non mutually exclusive. Once again, one can see that AS(ti ) satisfies Property 2 (as there is a single reference node) and is therefore a valid input set for A1. Proof of soundness: As A1 is a recursive algorithm, proving soundness requires to prove that A) the algorithm computes the probability of σ in a base case, B) the algorithm splits σ into subsets still satisfying Property 1/2 and that C) the algorithm opportunely combines the probabilities of the split subsets. We start by assuming all outcomes in σ to refer to a single node t′ (as it is the case for activation sets); the result will be then generalized. As starting point, observe that: Theorem 3 If all outcomes in σ refer to a single node t′ , then either σ contains a single outcome (|σ| = 1) or there exists no Outh ∈ σ such that SMhk = 0 and EMhk = 0 for every other Outk ∈ σ. Proof This is a direct consequence of Property 1; since all outcomes are collected along paths starting from a single node t′ , hence every pair of Outh , Outk either is collected on the same path (and is therefore needed for some arc status to be true and SMhk = 1) or it is collected after a branch node (and therefore EMhk = 1). Theorem 3 immediately applies to activation sets (as they have a single reference node). More in general, however, an input set σ can always be thought as a collection of subgraphs with different reference nodes: each subgraph has a single reference node and therefore complies with Theorem 3. A) A1 computes probability of σ in the base case: The base case for A1 is when there is no pair of outcomes Outh , Outk such that EMhk = 1. Due to Theorem 3, in such a situation σ is a collection of subgraphs (let those be σ0 , σ1 , . . .) rooted at different reference nodes; for each of them we have |σi | = 1 or SMhk = 1 for every pair of Outh , Outk ∈ σi . This corresponds to subgraphs with a single outcome, or where all outcomes are collected on a single path: the conditional execution probability in this case is Πh p(Outh ), which is the value returned by the algorithm. B) A1 splits sets σ into subsets satisfying Property 1 and Property 2 At each step A1 identifies a set B of outcomes originating at the same branch and partitions σ into:

47

T a: C = Outh ∈B (σ ∩ SMh ) T b: R = Outh ∈B (σ \ SMh ) c: for each Outh ∈ B: ((σ ∩ SMh ) \ C) ∪ Outh Consider any of the listed set, let us refer to this as ρ; recall it is always possible to choose a set of reference nodes t′ , t′′ , . . . such that Property 1 is satisfied. Per absurdum, suppose now the set does not satisfy Property 2, thus it contains two mutually exclusive reference nodes; let those be t′ and t′′ ; then ρ must contain two outcomes Outh , Outk such that t′ and t′′ are their reference nodes; note that Outh , Outk must be mutually exclusive as well, with the mutual exclusion originating at some branch, let this be ti . Algorithm A1 visits branches in topological order, therefore ti either 1) has been visited, meaning its outcomes have already been selected in a B set in the algorithm, or 2) is currently being visited, meaning its outcomes are part of the current B set. In case 2, then Outh and Outk originate at the same branch; in this situation A1 puts them into different subsets of type “c”, thus contradicting the hypothesis. In case 1, there are some other outcomes between ti and Outh and Outk (let them be Out′h , Out′k ); hence for such outcomes it must hold SMhh′ = SMkk′ = 1. In this situation A1 would have put Outh and Outk in different subsets of type “c” at some earlier step, thus contradicting the hypothesis. Therefore, all sets produces by algorithm A1 satisfy both Property 1 and Property 2. C) A1 properly combines probabilities All the subsets produced by algorithm A1 at each step correspond to (collections of) subgraphs; let σ be the initial input set, and let it have a single reference node – the graph root – according to Property 1: this is true both for activation sets and for coexistence sets. 1. type “c” subsets (Outh ∈ B: ((σ ∩ SMh ) \ C) ∪ Outh ) represent mutually exclusive subgraphs having the same reference node as σ; in the following, let Xh be such a set and X the whole collection (X = {Xh | Outh ∈ B}). As “c” sets are mutually exclusive their probability can be summed. T 2. “a” subsets (C = Outh ∈B (σ ∩ SMh )) represent subgraphs containing outcomes required for the execution of the original subgraph σ and depending on subgraphs of type “c”. T 3. “b” subsets (R = Outh ∈B (σ \ SMh )) represent subgraphs independent (in the probabilistic sense) on outcomes Outh ∈ B. Therefore, the probability of the original set σ is: probability of σ = p(X ∧ C ∧ R)

(12)

= p(X ∧ C) p(R)

(13)

= p(X) p(C|X)p(R) X = p(Xh |root) p(C|X) p(R|root)

(14) (15)

Outh ∈B

Where step (12) just shows probability of σ in terms of the described decomposition, at step (13) p(R) is detached (as R is probabilistically independent on X), at step (14) the Bayes theorem is applied to p(X ∧C) and at step (15) p(X) is expressed as a sum of p(Xh |root) – with p(Xh |root) = p(Xh ). One can see that expression (15) is computed by A1 at lines 11-14; all appearing conditional probabilities can be computed with A1.

48

Appendix D - The coexistence set is a valid input set for A1 A set of outcomes requires to satisfy Properties 1 and 2 to be acceptable for A1. Thus, we have to show that the output of A2 has a set of pairwise non-mutually exclusive nodes according to Property 1. Note that CS(ti , tj ) is built by iterative addition (to the empty set) of sets σ and AS(tj ) \ D (line 12). Sets σ are sufficient for ti to execute and compatible with tj , sets AS(tj ) \ D are sufficient for tj to execute and compatible with ti ; note that also the union of σ and AS(tj ) \ D retains such properties. Therefore: 1. outcomes in sets σ must have non mutually exclusive reference nodes (or they would not let ti execute, according to Property 1) 2. outcomes in sets (AS(tj ) \ D) must have non mutually exclusive reference nodes (or they would not let tj execute) 3. reference nodes from sets σ and (AS(tj ) \ D) must not be mutually exclusive, or they would not let ti and tj execute at the same time. Therefore reference nodes for outcomes in CS(ti , tj ) are non-mutually exclusive and the set is acceptable as input for A1.