Data-Flow Hard Real-Time Programs: Scheduling ...

1 downloads 0 Views 213KB Size Report
Data-Flow Hard Real-Time Programs: Scheduling. Processors and Communication Channels in a. Distributed Environment. Renzo Davoli. Fabio Tamburini.
Data-Flow Hard Real-Time Programs: Scheduling Processors and Communication Channels in a Distributed Environment

Renzo Davoli

Fabio Tamburini

Technical Report UBLCS-99-7 April 99

Department of Computer Science University of Bologna Mura Anteo Zamboni 7 40127 Bologna (Italy)

The University of Bologna Department of Computer Science Research Technical Reports are available in gzipped PostScript format via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS. All local authors can be reached via e-mail at the address [email protected]. Questions and comments should be addressed to [email protected].

Recent Titles from the UBLCS Technical Report Series 97-4 An Algebra of Actors, M. Gaspari, G. Zavattaro, May 1997. 97-5 On the Turing Equivalence of Linda Coordination Primitives, N. Busi, R. Gorrieri, G. Zavattaro, May 1997 (Revised October 1998). 97-6 A Process Algebraic View of Linda Coordination Primitives, N. Busi, R. Gorrieri, G. Zavattaro, May 1997. 97-7 Validating a Software Architecture with respect to an Architectural Style, P. Ciancarini, W. Penzo, July 1997. ¨ Babaoglu, R. Davoli, A. Montresor, R. 97-8 System Support for Partition-Aware Network Applications, O. Segala, October 1997. 97-9 Generalized Semi-Markovian Process Algebra, M. Bravetti, M. Bernardo, R. Gorrieri, October 1997. ¨ Babao˘glu, R. Davoli, A. 98-1 Group Communication in Partitionable Systems: Specification and Algorithms, O. Montresor, April 1998 (Revised May 1999). 98-2 A Catalog of Architectural Styles for Mobility, P. Ciancarini, C. Mascolo, April 1998. 98-3 Comparing Three Semantics for Linda-like Languages, N. Busi, R. Gorrieri, G. Zavattaro, May 1998. 98-4 Design and Experimental Evaluation of an Adaptive Playout Delay Control Mechanism for Packetized Audio for use over the Internet, M. Roccetti, V. Ghini, P. Salomoni, M.E. Bonfigli, G. Pau, May 1998 (Revised November 1998). 98-5 Analysis of MetaRing: a Real-Time Protocol for Metropolitan Area Network, M. Conti, L. Donatiello, M. Furini, May 1998. 98-6 GSMPA: A Core Calculus With Generally Distributed Durations, M. Bravetti, M. Bernardo, R. Gorrieri, June 1998. 98-7 A Communication Architecture for Critical Distributed Multimedia Applications: Design, Implementation, and Evaluation, F. Panzieri, M. Roccetti, June 1998. 98-8 Formal Specification of Performance Measures for Process Algebra Models of Concurrent Systems, M. Bernardo, June 1998. 98-9 Formal Performance Modeling and Evaluation of an Adaptive Mechanism for Packetized Audio over the Internet, M. Bernardo, R. Gorrieri, M. Roccetti, June 1998. 98-10 Value Passing in Stochastically Timed Process Algebras: A Symbolic Approach based on Lookahead, M. Bernardo, June 1998. 98-11 Structuring Sub-Populations in Parallel Genetic Algorithms for MPP, R. Gaioni, R. Davoli, June 1998. 98-12 The Jgroup Reliable Distributed Object Model, A. Montresor, December 1998 (Revised March 1999). 99-1 Deciding and Axiomatizing ST Bisimulation for a Process Algebra with Recursion and Action Refinement, M. Bravetti, R. Gorrieri, February 1999. 99-2 A Theory of Efficiency for Markovian Processes, M. Bernardo, W.R. Cleaveland, February 1999. 99-3 A Reliable Registry for the Jgroup Distributed Object Model, A. Montresor, March 1999. 99-4 Comparing the QoS of Internet Audio Mechanisms via Formal Methods, A. Aldini, M. Bernardo, R. Gorrieri, M. Roccetti, March 1999. ¨ Babao˘glu, April 1999. 99-5 Group-Enhanced Remote Method Invocations, A. Montresor, R. Davoli, O. 99-6 Managing Complex Documents Over the WWW: a Case Study for XML, P. Ciancarini, F. Vitali, C. Mascolo, April 1999. 99-7 Data-Flow Hard Real-Time Programs: Scheduling Processors and Communication Channels in a Distributed Environment, R. Davoli, F. Tamburini, April 1999. 99-8 The MPS Computer System Simulator, M. Morsiani, R. Davoli, April 1999. 99-9 Action Refinement, R. Gorrieri, A. Rensink, April 1999.

Data-Flow Hard Real-Time Programs: Scheduling Processors and Communication Channels in a Distributed Environment Renzo Davoli1

Fabio Tamburini2

Technical Report UBLCS-99-7 April 99 Abstract This work concentrates on periodic Data-Flow programs having Hard Real-Time constraints. They are represented by precedence DAG, able to express control dependencies and communication flows among the composing nodes. A mathematical model to compute a feasible schedule is proposed, taking into account the exchanging of data between nodes using a deterministic transmission network. The model is then transformed into an optimization problem and a solution based on Simulated Annealing technique is presented.

1. Dept. of Computer Science, University of Bologna, Piazza di Porta S.Donato 5, I-40127 Bologna, Italy. E-Mail: [email protected] 2. C.I.L.T.A., University of Bologna, Piazza S.Giovanni in Monte 4, I-40124 Bologna, Italy. E-Mail: [email protected]

1

1

Introduction

Coarse-Grain Data Flow (CGDF) has been proved to be a well suited paradigm to exploit the parallelism in parallel and distributed systems [1] especially for signal processing [8]. In this paradigm each program is represented by a directed acyclic graph (DAG) whose nodes represent the execution of sequential procedures having neither synchronization nor communication actions hidden inside and whose arcs depict data exchange and/or synchronization actions. In this paper we focus on the class of Hard Real-Time (HRT) systems, those which may cause catastophic effects in case of a timing deadline being missed. Several applications, such as those related to signal processing, may need parallel execution to achieve a sufficient degree of performance to fulfil the timing requirements of the problem. At the same time in a HRT environment the temporal behavior of each application must be deterministic, at least with regard to the computation of the worst case response time. The system has to ensure that the worst case execution terminates before the deadine. The processing environment we assume is multiprocessor and loosely coupled; a distributed system is a well-known example of this sort of architecture. We also assume that the processors are interconnected by a time division (TDM) communication network; each processor is allowed to send a communication unit (packet) only during its own time slot. Time slots are assigned to processors in a cyclic order. TDM, by avoiding collisions and contentions, ensures a deterministic behavior of the network. All the processors and network links are assumed to be homogeneous. Programs progress at the same speed on each processor in the system while single packet messages at the network level are delivered at the same speed, i.e. taking the same time, along all the network links. We present a model to express the schedulability problem for a set of Data-Flow (DF) periodic programs. We also provide an algorithm that computes a mapping function and the response time on each node of the CGDF program. This algorithm can be used as a schedulability checker and off-line scheduler in a HRT system. In HRT program tailored systems, an off-line scheduling analysis [20] is necessary to ensure the availability of processing resources when a critical situation arises. Dynamic on-line schedulers are not appropriate when dealing with HRT programs: given the finite amount of available processing resources, the system may not be able to honor the requests of the new entering program. In an HRT environment this case is equivalent to a timing failure, so it may lead to dangerous situations. The proposed method uses a simulated annealing procedure to find a suitable mapping and time assignments for nodes and communication events which satisfy all the constraints. Processors and network links must never be overloaded, the program deadlines have to be respected and the loads must be consistent. The method also defines a polinomial complexity bound algorithm to compute the worst response time and the processor and network load caused by each program, given a mapping and a load assignment. The paper is organized as follows: section 2 states the problem describing the characteristics and attributes of the environment and the programs, section 3 compares this work with the the current state of research on the same topic, sections 4 and 5 contain a detailed description of the method, and section 6 shows the effectiveness of the proposed method by presenting some simulation results.

2

Problem Statement

In the model we propose, programs are not atomic entities: they consist of nodes that may be executed concurrently, and hopefully in parallel, by different processors. The data communications and precedence relations between the nodes of a program are expressed using the deterministic Coarse-Grain Data Flow paradigm. In the DF paradigm, programs are represented by acyclic graphs, the arcs of which embody both data communication and timing precedence relationships and the nodes of which are procedures having functional behavior (inputs are known at activation time, results are given at the end, there is no communication in-between). This paradigm UBLCS-99-7

2

makes all the sequential and concurrent parts of the problem explicit, and thus this definition is particulary applicable to HRT problems with fixed deadlines [15]. The results presented in this paper refer to a specific subset of DF programs: scatter-gather (SG) programs. These are created by the recursive application of two basic operators: a sequence operator and a concurrent execution operator. The principal advantage of this restriction consists of the possibility of better minimizing or totally eliminating interferences between the different concurrent components. The related lack of generality, in our opinion, cannot be considered so important; the class of problems that can be described with such a model is not banal and represents a powerful platform to start from. In our model the deadline of each activation coincides with the period end. Periodic programs may also have different periods and deadlines, but this case is less common in applications. In [4] we briefly explain how our method could be adapted to handle these cases. Although our scheduling approach is tailored to periodic programs, it may deal with sporadic and event-driven programs as well, given that the maximum signalling frequency of the triggering events (for event-driven programs) and the maximum response time (for both) are known a priori. Moreover, each program minimum activation delay must be greater than the time interval needed to perform its task (without exceeding its RT constraints). Given these conditions, in fact, it is possible to manage all programs as if they were periodic [14]; therefore, a knowledge of the requested parameters is necessary to limit the amount of resources to be allocated. Checking the schedulability of a set of DF programs means not only providing a suitable mapping function but also evaluating how to distribute the response time of each program among its nodes and its communication events. As for mapping, we regard this as a function having a logical set of nodes as its domain and a physical set of processors as its co-domain. From a theoretical point of view, the mapping problem consists of finding a partition of nodes leading to a feasible schedule. This is a search problem on all subsets of the node set. Clearly, an optimal solution can be found only by performing an exhaustive search on this set, involving an intractable NP problem: this operation is in general isomorphic to the N-knapsacks problem so it cannot be solved in polynomial time (unless P = NP ). Moreover, changing the response time assignment of a node affects the processor load for its execution; in the same way, changing the requested response time of a communication event modifies the network load. Therefore, even if the mapping had been fixed a priori, a suitable time assignment both for nodes and communication arcs of the DF program would have to be computed. In [3] we presented a deterministic algorithm to compute the time assignment function for nodes, given a previously computed mapping between nodes and processors. The result was extended in [4], where we introduced the use of Simulated Annealing methods to perform the non-determistic search. Neither of these papers take into account the communication costs. Here we wish to further extend the results by introducing a method for computing the mapping function and the time/load assignment for both nodes and communication events. The simulated annealing method has also been enhanced by the use of local optimization methods.

3

Related Work

Liu and Layland in their fundamental work [12] have shown that Earliest Deadline First (EDF) scheduling is optimal for a set of independent periodic processes on a single processor. Schedulability checking in such a system is carried out by controlling that the processor is not overloaded. Liu ad Layland’s result is here extended in three independent directions: from a single processor to a multiprocessor loosely coupled environment, from unrelated processes to a set of executable entities with data and precedence constraints, from a method for computing processor loads and checking task schedulability to a method capable of computing both processor and network loads, in order to check both task and communication schedulability. In a loosely coupled (private memory) environment, data exchange between tasks running on different processors is possible, while process migration should be avoided. This assumption is consistent with common implementations: a process migration involves either the real-time UBLCS-99-7

3

move of the task code or the availability of the code on all the processor memories and, in any case, the process state transfer. This operation would lead, in our opinion, to a waste of system resources and to a lower degree of time determinism. As a result, there could be a clash with HRT requirements. This characterization makes the classical results for shared memory systems (e.g. [13]) inapplicable to our model. In particular, see the non-optimality result of Mok [14] for EDF scheduling in multiprocessor (tightly coupled) environments; it does not apply to our model because we run a set of independent EDF schedulers, one for each processor, with a pre-computed mapping, not a global EDF scheduler for all processors. Several papers listed in the bibliography have similarities with our work or address different but related problems. Muntz and Coffman [15] presented a method for minimizing the schedule length of a single rooted-tree structured program to be executed once (single shot, not periodic) in a multiprocessor environment. Gerber, Hong and Saksena [7] addresses the problem of feasibly scheduling periodic processes with precedence constraints on a single processor. Their work is specifically tailored to satisfy end-to-end timing parameters of an application when data has to be managed by processes having different periods. Zhu, Lewis et al. [23] also address a similar problem: in their model a real-time program is described by a graph having two arc sets to represent data and precedence relationships separately. Each process has its own period and the operating environment is uniprocessor. Our work adds a new dimension to this topic of research: parallel execution. The model we present deals with a set of parallel periodic programs. In our opinion it is both essential and common for HRT programs to be able to perform heavy CPU-bursting computations. In such cases parallel execution is crucial. Moreover, the model could work as a basis on which to design a modular general purpose environment for (Hard) Real-Time applications. Sih and Lee [18, 19] and El-Rewini and Lewis [16] have already addressed the problem of scheduling graph structured programs onto a set of processors. They compute the minimal schedule length of a set of tasks given their precedence relationships using a non-preemptive policy. Their method could verify the consistency of a periodic execution by checking whether the minimal schedule length is shorter than the (single) requested period, but their consideration cannot be directly applied to an environment where several graph structured parallel programs having different periods are to be executed on a single parallel system. Our method addresses the latter problem: it searches for the feasible schedule that minimizes the processor usage while their search for minimal schedule generally leads to a non-optimal resource allocation: all the processors are used more than necessary to respect the timing constraints. The scheduling policy is also different: we adopt a variable priority preemptive paradigm while they use a fixed priority scheme. Tindell, Burns and Wellings [21] address the problem of scheduling a set of tasks onto a parallel computer taking into account the data dependencies between them. The scheduling paradigm is fixed priority. They do not consider graph structured periodic programs but instead in their model each task has its own period. This fact leads to the same class of synchronization problems to compute the end-to-end timing of an application as discussed in the already cited papers by Gerber et al. [7] and Zhu, Lewis et al. [23]. Off-line scheduling is necessary to pre-allocate resources before starting the critical process control in order to give real guarantees. This approach is common in telecommunication models when Hard Real-Time requirements are needed. For example, pre-allocation methods are used to reserve bandwidth or virtual circuits on communication links when a requested quality of service (QoS) must be guaranteed. In this sense, our research has similarities with projects such as Tenet (Berkeley University) [6].

4

A Formal View

The basic problem addressed here is to obtain a sufficiency condition that allows the execution of a set of unrelated programs JSET = fJ1 ; :::; JNPGMS g having HRT requirements, on a set of UBLCS-99-7

4

processors PSET = fP1; :::; PNPROC g in a safe way. The period of a program Ji is Di . Each activation of Ji has to be completed within Di time units; moreover, Di is also the minimum time between two sequential activations of Ji . All programs Ji consist of atomic execution entities named nodes. Different nodes, even if they belong to the same program, can be executed on different processors (and they generally are, when possibile, in order to achieve parallel execution). Let NSET = fn1; ::::; nNNODE g denote the set of all the node identifiers (of all programs). Each identifier is unique, i.e. it appears once in the whole set of programs. Let ck be the computational cost (weight) of node nk . ck represents the time for node nk to be executed on a single dedicated processor, regarding its algorithm worst case: the effective nk computation time clearly depends on input data. It generally changes with different instances; it is non-negative (i.e. it may be zero), but in no case can it exceed ck (on a single dedicated processor). When several nodes are mapped on the same processor, ck represents the amount of processor time node nk needs to complete its computation, the response time of nk must take into account the delays due to the execution of the other concurrent nodes. Processors are thought to be homogeneous, so the computational costs are independent of node mapping. Let ASET = fa1; ::::; aNARC g denote the set of all the arcs involved in the programs, i.e. the set of all communication events. Let netch be the computational cost (weight) of arc ah . It represents be the maximum amount of data node source(ah ) sends to node dest(ah ). As for the computational cost of nodes, ah must be evaluated in the algorithm worst case. All the Ji programs are Scatter-Gather (SG) programs, i.e. the structure of their control or data dependencies graphs are not general but can be reconstructed by the recursive application of two basic operations, sequence and concurrent execution. More formally, an SG program can be defined (using a recursive definition) as follows:  a single node nk 2 NSET is an SG program; the sequential concatenation of two SG programs is an SG program;  the concurrent execution of two SG programs is an SG program. SG programs can be expressed in a natural way as Regular Expressions (without operator ). Concurrent execution is a binary operator 0 j0, while sequence is indicated by an implicit operator, i.e. sequential elements are simply juxtaposed. Round parentheses are used for grouping. Our schedulability problem can be described as a search (i) to build a mapping function between all nodes and the set of available processors and (ii) to assign a requested response time rk (or equivalently a quota of processing power k = ck =rk ) to each node in the system and a requested response time for each communication Netrh , such that:  [1.] Response times rk (or load specifications k ) and Netrh are positive.

The global response time of the each program Ji must be less than the period Di . 2. No processor Pj is overloaded. 3. 4. The communication network is not overloaded. 5. Each node is assigned exactly to one processor. The global response time of a program Ji is composed by two terms: computation time and communication time. Computation time of Ji can be computed using a recursive evaluation (that follows the recursive definition of SG programs) by adding the response time of sequential executed sub SG programs and taking the maximum time value between concurrent ones. Processor load can be computed in a symmetrical way. In fact, the load of concurrently executed sub-SG programs is the sum of the load of all the subprograms while the load of sequential components is the maximum load among them as sequential components cannot contend the processor. Figure 1 illustrates the method. The global response time could be easily computed also using a different method which works for any kind of DAG-based data flow program: following a topological sort of the nodes, the response time of a program from its beginning up to each node is the sum of its response time and the maximum among all the response times up to each preceding node. However, while processor loads of SG programs can be computed with a polynomial complexity algorithm as shown above, it can be proved (see Appendix A) that the processor load



UBLCS-99-7

5

nk

Time(nk)=ck/λk

λk when µk= p Load(nk)= 0 otherwise

A|B

AB

Time(A|B)=max{Time(A),Time(B)} Load(A|B)=Load(A)+Load(B)

Time(AB)=Time(A)+Time(B) Load(AB)=max{Load(A),Load(B)}

Figure 1. Load and Time functions for SG programs.

for DAG based program is an NP complete problem. The evaluation of the communication costs involves both the development of a method to express the precedence relationships among the communication events and the presentation of a realistic protocol that guarantees such a precedence pattern. The basic idea to unify the method used to compute the processor and network schedules is to consider each shareable network resource as if it were a processor. In fact, as nodes have to be mapped onto processors, communication events have to be mapped on “networks”. In the former case, the method needs to check for processors to be non-overloaded, whereas in the latter case the global requested bandwidth to any network has never to exceed the global bandwidth of the network. If we ignore at this point the contention on the media (the problem will be dealt with below), a bus-based diffusion like Ethernet could be seen as a single network resource, a network processor. Ethernet is clearly a well known network environment but its timing behavior can be estimated a-priori only in a probabilistic way. Ethernet (on its own) is not a suitable Data-Link protocol for hard real-time applications. Using TDM methods to avoid collisions and contentions on the media, each processor appears as having a corresponding independent network resource, shareable by all the processes mapped on that processor, i.e. its own network processor. In fact, each processor can use its bandwidth slice assigned by TDM with no interference from the other processor. Apart from a constant delay used to access the time slot, we can consider a TDM network as if it were composed by several independent networks, each one having 1/nth of the bandwidth and a single transmission processor. In this scenario, network processors can be implemented in software or in hardware, they can be built on or beside the corresponding processor as shown in Fig. 2. Each network processor fills the TDM slot when there is data to be sent. Network processors can also enforce priorities on the communication events by enqueueing in an approriate way the packets to be sent. Thus, in this scenario, processor scheduling technologies like EDF can be applied also to network. Let us name the set of network processors NetP SET = fNetP1; :::; NetPNPROC g. UBLCS-99-7

6

Computation Unit 1 CP

NP

Computation Unit NPROC ...

NP

CP

Deterministic Network CP - Computation processor NP - Network processor Figure 2. Computation system model.

Figure 3. On the left we have a program precedence graph, on the right the correspondent communication graph. In the middle the two graphs are superimposed to show the correspondence.

It may appear, at first sight, that the problem of defining the precedences among the communication events can be solved by building a communication graph whose nodes (represented as small squares) correspond to the arcs of the program precedence graph and whose arc can be inherited from the precedence relationships of the original graph. Unfortunately this method does not transform SG programs into SG-structured graphs. Fig. 3 shows a counter-example: an SG graph having a non-SG communication graph. Thus, as we have already discussed before, the computation of the load for general graphs would lead in general to an intractable NP -complete problem. The method of computing the network load presented in this paper eliminates the NP completeness: it defines and enforces a priority schema on message enqueueing consistent with a precedence graph on communication events which is SG. The procedure to compute the SG communication graph is illustrated in Fig. 4. Given an SG program (P) we can define the comm. graph Com(P) using the following recursive process:  if P is composed of a single node, the comm. graph is an empty graph with one input and one output directly joined;  if P is composed of two concurrent SG programs A and B , i.e. the program is P = AkB , the comm. graph Com(P) is the union of Com(A) and Com(B), it has all the inputs of Com(A) and Com(B) and all the outputs of Com(A) and Com(B);  if P is composed of a sequence of two SG programs AandB , i.e. the program is P = AB , the comm. graph Com(P) is computed in four different ways depending on the number UBLCS-99-7

7

of outputs of Com(A) and the number of inputs of Com(B). In any case the set of inputs of the resulting comm. graph coincides with that of Com(A) whereas the set of outputs coincides with that of Com(B). – if Com(A) has exactly one output and Com(B) has exactly one input, Com(P) is composed of the Com(A) followed by a single node followed by the Com(B); – if Com(A) has exactly one output and Com(B) has several inputs, Com(P) is composed by Com(A), Com(B) and some nodes between them: Com(A) is followed by as many nodes as the inputs of Com(B), each one linked to the single output of Com(A) and to a different input of Com(B); – if Com(A) has several outputs and Com(B) has exactly one input, Com(P) is composed by Com(A), Com(B) and some nodes between them: Com(A) is followed by as many nodes as the outputs of Com(A) each one linked to a different output of Com(A) and to the single input of Com(B); – if Com(A) has several outputs and Com(B) has several inputs, Com(P) is composed by Com(A), Com(B) and some nodes between them: Com(A) is followed by as many nodes as the number of outputs of Com(A) multiplied by the number of inputs of Com(B). Each one of these nodes is connected to all the outputs of Com(A) and all the inputs of Com(B). The resulting communication graph is SG as all the operations involved in the construction algorithm are consistent with the definition of SG graphs. It is clear that the communication graph created with this algorithm is similar to that created by the canonical method introduced earlier, by associating a communication node to each arc in the precedence graph of the program and by propagating the dependencies from one graph to the other. In effect they coincide in number of node and structure except in the case of complete meshes: when in an SG program there exist structures as AB with A and B SG subprograms having respectively A several outputs and B several inputs, the algorithm interconnects them by a complete mesh of arcs, while the mesh is not complete in the result of the canonical transformation. It means that some extra dependencies among communication events have been introduced to enforce the SG constraints on the resulting graph. From a more accurate analysis it can be seen that the dependencies added are reasonable, the method forces the communication which feeds the first layer of a complete mesh to preempt over all the communication events from the first to the second layer (or the second layer inputs preempt the outputs) even when they are not directly related. In most practical cases, this kind of avoidance of the contention on the communication resource is preferable. Formally the goal in is to find (if any) a vector  of load-assignment values 1; :::; NNODE , a vector Net of network load-assignment values Net1 ; :::; NetNARC and a mapping vector  = (1 ; :::; NNODE ); k 2 f1; :::; NPROC g that binds each node to a specific processor, such that:



8 > > > > > > > > > > > > < > > > > > > > > > > > > :

k > 0 8k = 1; :::; NNODE Neth > 0 8h = 1; :::; NARC T ime(; Ji) + NetT ime(Net; Ji)  Di 8i = 1; :::; NPGMS NPX GMS Load(; ; Ji ; Pj ))  1 8j = 1; :::; NPROC i=1

NPX GMS

i=1

NetLoad(Net; ; Ji; Pj ))  1

(1)

8j = 1; :::; NPROC

Each element of the  vector, k , represents the amount of processor power reserved for the corresponding node k. Using the relation rk = ck =k to compute the response time of each node, it is possible to transform (1) to obtain a response time vector r instead of a load-assignment vector  (these two representations are perfectly equivalent). In a similar way, Neth represents the amount of network processor power reserved for the corresponding arc h. The relation UBLCS-99-7

8

A|B

AB 1:1

AB 1:N

AB M:1

AB M:N

Figure 4. Procedure to build the communication graph given the SG precedence graph.

UBLCS-99-7

9

Netch computes the response time of the communication event represented by the arc Netrh = Net h h. The T ime function computes the maximum response time for program Ji as introduced above and shown in Fig. 1: given a vector  of node-load assignments: it adds the deadlines (rk ) of all sequential nodes and takes the maximum value between concurrent ones. The Load function computes the maximum load for processor Pj executing the nodes of program Ji mapped on it. It adds the concurrent components and takes the maximum value of all sequential ones. NetT ime and NetLoad functions operate in the same way as their counterpart Time and Load but they uses the SG communication graph instead of the program graph and the entitites related to network load (ASET, Net, Netc) instead of (NSET, , c).

5

Optimization Procedure

The optimization procedure has to produce a feasible solution to the problem (1). This is done by defining a function, often called energy function, that has its minimum when all the problem constraints are satisfied (this minimum is not necessarily unique). Searching for a minimum of the energy function means finding a feasible solution to the scheduling problem (1). The method we use was introduced by Ingber [9] [10] as a modification of the Simulated Annealing technique. In order to speed up annealing process, allowing fast convergence to a feasible solution, we adopted a hybrid algorithm introducing a local search heuristic [5] [22] into the ASA method. 5.1

Energy Function Definition.

Our problem (1) involves two kinds of unknowns:  the  and Net vectors containing the processor load assigned to computation nodes and network communications. In this section we consider the vector  as the concatenation of  and Net vectors (  Net);  the  mapping vector between nodes and processors. The values of each unknown, i , span in the range ]0; 1], where 0 means no work and 1 means maximum load (every cycle of the corresponding processor is used by this node). The values of i span in the range [1; NPROC]. As we said before, each computation processor has a network processor linked to it in the same computation unit, so each arc starting from a node mapped on a computation processor, say R, is mapped on the corresponding network processor R. In such a way we derive the communication mapping from the node mapping. Given the above definitions the energy function can be written as: NP GMS X

E(; ) = E(  Net; ) =  

i=1

NP ROC X

j =1

NP ROC X

j =1



 Time(; i) + NetT ime(Net; i) Di +

 (

NP GMS X

i=1

!

Load(; ; Ji; Pj )) 1 +

NP GMS X

 (

i=1

(2) !

NetLoad(Net; ; Ji; Pj )) 1

z0  (z) = 0ez 1 for for z > 0  ; and  are constants used to balance the strength of constraints.

The function E(; ) defined in (2) has its minimum in E(; ) = 0 when all the problem constraints (1) are satisfied. (Note that the first constraint in is implicitly met by our unknown definitions, while the fifth is fulfilled by the definition of the object  as a functional relationship).



5.2 Annealing Process. As stated before, our problem involves two kind of unknowns. This separation derives from intrinsically different problems that we have to solve. Finding a mapping between nodes and UBLCS-99-7

10

X

Cost

Cost

Y

Z

States

States

(a)

(b)

Figure 5. (a) This is an example of a simulated annealing step using a local search technique. The accept/reject decision is taken on state Z after the application of the local optimizer. (b) The local search method maps each value in a basin of attraction into the value of the corresponding minimum, so the energy function is transformed in a step function.

processors is reducible to the N-Knapsack NP -complete problem, requiring complex heuristic methods to be succesfully solved/approximated. We address the solution to this problem using a simulated annealing technique, applied only to  set of unknowns. In [2] and [3] we proved that having fixed the mapping assignment between nodes and processors the space generated by  variables and the energy function is a convex space, allowing a number of local search techniques to be successfully used. So once one mapping has been fixed during the annealing process, the best load assignments () can be found using, for example, a hill-(de)climbing method. Adaptive Simulated Annealing (ASA) [9] [10] [17] is a global optimization algorithm. The result of the algorithm is the discovery of the best global fit of a non-linear, non-convex costfunction over a  -dimensional state space. New states  are generated using the temperature and a random variable derived from a uniform distribution as a parameter for a perturbation function, and constrained into the search state space. Deriving new parameters according to this method leads to an exponential temperature scheduling, Tk = T0 exp( k1= ), where T0 is the starting temperature and is a user-defined parameter. Newly generated states  are subjected to an acceptance probability function h(E; Tk ) = 1=(1 + exp(E=Tk )). At each step a variable i is randomly chosen, a new value 0i is computed, according to the above rules, and accepted using the probability function h. Figure 6 shows the algorithm in detail. This annealing schedule ensures that a global minimum of energy function can be obtained statistically (proven in [9]). If the dimension of the sampled space becomes too high, the convergence process may be prohibitively slow and one cannot commit resources to properly sample the search space ergodically. One widely used solution to this problem combines the annealing method with a local search algorithm. Let us show the method using the example in fig. 5a. Suppose that the starting state X is locally optimal. We apply the simulated annealing method to ”kick” the state out from local minima that do not satisfy our requirements, obtaining the state Y . The standard annealing methods apply the accept/reject procedure directly to the state Y . But it is much better to apply a local minimum search, reaching the state Z, before taking the decision. In such a way we apply the accept/reject procedure only to local optima state, removing a lot of computations on non-optimal states. Applying this method is equivalent to transforming the energy function in a step function that maps each value in a particular basin of attraction of E to the value of the corresponding minimum (see fig 5b). UBLCS-99-7

11

Doing so one can profitably use the best feature of both methods. The annealing method ability to avoid being trapped into local minima, combined with the fast local optimum finding of local search methods, allows fast convergence to the global optimum of energy function. Annealing Process() begin Init and  with random values; := 1; repeat 1= ); k := 0 exp( 0 := ;

k



T T

k   i := Random(1; 2  NNODE ); u := Random(0; 1); y 0:= Sgn (u 1=2) Tk[(1 + 1=Tk)j2u 1j 1]; i := 0i +0y; Constrain i into [1; NPROC ]; 0 0 := Local Optimize( 0  ); if (Accept(h(E ( ; 0 ) E (; ); Tk ))) then  := 0 ;  := 0 ; k := k + 1; until (E (; ) = 0) or (Tk < MinTemp); if (Tk < MinTemp) then return NotFound; else return ; ;

end;

Fig.6a. Simulated Annealing algorithm pseudo code. (The function Random(a; b) extracts uniform random numbers into the interval [a; b]. Sgn is the sign function).

x

Local Optimize( ) begin Initialize( ); := 0; while ( ) do := 0; while (( ( + ) ( )) and ( )) do :=Random Vector( ); := + 1; if( ( + ) ( )) then := 2; else := + ; := 2 ; return( ); end

u

v

jvj  THRESHOLD iter E x v >E x iter < MAXITER v v iter iter E x v >E x v v= x x vv v x

Fig.6b. Local search algorithm. It finds the local minimum starting from state x. THRESHOLD and MAXITER are two constants used to tune the algorithm.

If the scheduling problem cannot be solved, the annealing algorithm stops returning the constant NotFound, when the temperature reaches a pre-assigned minimum (MinT emp).

6

Simulation Results

To show the effectiveness of our approach, we carried out several simulations with different degrees of complexity. Some DF programs, such as QSort, FFT and Wavelet Transform, were used in various combinations and with different deadlines. Figure 7 outlines a sample problem and the response time vector r and mapping function. The deadlines used are quite close to the theoretical limits imposed by node costs. These limits were computed considering the program structures, the single node costs and the resources available in the system. In our test the limit deadlines were computed by hand, having a priori knowledge of the best mapping function. The first table shows the mapping information and the response time of each computation node. The second table outlines the result concerning the communication arcs, listing the arcs with real communication. The other arcs link nodes mapped on the same processor, so no real communication flows through them (Netrh = 0).

7

Conclusions

The schedulability problem for a set of Coarse Grain Data Flow SG programs, communicating on deterministic networks, can be expressed either as a non-linear set of equations or as an equivalent non-linear, non-convex function to be minimized. The independent variables, or the domain variables for the functional specification, are the load (or timing characteristics) of each node and communication arc and the mapping information of each node. Although this specification does not change the hard nature of the problem, it allows a number of challenging non-deterministic techniques to be applied. We showed that Adaptive Simulated Annealing can be used to give a feasible solution to the schedulability problem. A number of extensions could be introduced into our model. The number of context switches, as their cost, should be taken into account. Programs having a period which differs from its UBLCS-99-7

12

deadline should be managed. In [4] we showed approximate results that can be used as an estimate of the number of context switches. Moreover, in [4] we outlined a method to manage all the combinations of deadlines and periods in a single program, although in this case some resources are wasted. We are currently working on a model that embodies all the previous situations, managing them directly, without loss of system resources.

Appendix Processor load computation for a data flow program having a DAG precedence graph is in general an NP program. The processor load of the program is the maximum among all the load generated by set of nodes that can be executed concurrently. It is worth pointing out that for this purpose there is no difference between the single and multi processor program: for each processor in a multiprocessor environment, the load can be computed by reducing the graph to the subgraph of the nodes mapped of the processor under consideration or by assigning a null load to all the nodes mapped on other processors. Both ways lead to equivalent problems. Naming G = (V; E) the precedence graph of the program, the complement Gc of G where c G = (V; E c ) with E c = ffu; vg : u; v 2 V and (u; v) 2 E g is the concurrency graph. fu; vg 2 E c iff u and v represents nodes that can run concurrently. Note that Gc is an undirected graph. The processor load of the program is:

L = Wmax 2CS

X

x 2W

load(x)

where CS is the set of all the cliques in Gc i.e. W 2 CS iff W  V and 8x; y 2 W fx; yg 2 E c . More formally the NP completeness can be proved with the following reduction of clique problem which is proved to be NP -complete[11]. Assertion. Given a concurrency graph Gc , a general solution of the load problem can also solve the clique problem on Gc . Proof. The clique problem can be stated as follows: given a graph Gc = (V; E c ) and an integer K  jV j, does G contains a clique of size K or more? Assigning the load 1 to each node of the program the global load L is in this case the number on nodes composing the largest clique in Gc (or one of the largest cliques, as there may be more cliques having the same number of nodes). The clique problem can be solved by comparing K and L. qed.

References [1] Davoli, R., Giachini, L.A., Babaoglu, O., Amoroso, A., Alvisi, L.: Parallel Computing in Networks of Workstations with Paralex. IEEE Transactions on Parallel and Distributed Systems 7(4) (1996) 371–384 [2] Davoli, R., Giachini, L.A.: Schedulability checking of Data-Flow Tasks in Hard Real-Time Distributed Systems. UBLCS Technical Report Series 94-4 (1994) [3] Davoli, R., Giachini, L.A.: A Schedulability Algorithm for Data Flow Hard-Real-Time Distributed Programs. Proc. IFAC Real Time Programming, Lake Constance, (1994) 39–43 [4] Davoli, R., Tamburini, F., Giachini, L.A.: Scheduling Data Flow Programs in Hard-RealTime Distributed Environments. Proc. Formal Techniques in Real Time and Fault Tolerant Systems, Uppsala, LNCS, Springer Verlag, 1135 (1996) 263–278 [5] Desai, R., Patil, R.: SALO: Combining Simulated Annealing and Local Optimization for Efficient Global Optimization. Proc. of the 9th Florida AI Research Symposium, Key West, (1996) 233–237 [6] Ferrari, D., Verma, D.C.: A scheme for Real-Time Channel establishment in Wide Area Networks. IEEE Journal of Selected Areas in Communications 8(3) (1990) 368–379 UBLCS-99-7

13

[7] Gerber, R., Hong, S., Sakena, M.: Guaranteeing Real-Time Requirements With Resource Calibration of Periodic Processes. IEEE Trans. on Software Engineering 21(7) (1995) 579–592 [8] Goddard, S., Jeffay, K.: Distributed Real-Time Dataflow: An Execution Paradigm for Image Processing and Anti-Submarine Warfare Applications. Presented in the WIP session, RTSS, Washington D.C., (1996). [9] Ingber, L.: Very Fast Simulated Re-Annealing. Mathl. Comput. Modelling 12(8) (1989) 967– 973 [10] Ingber, L.: Adaptive Simulated Annealing. Technical Report, Lester Ingber Research, McLean, VA (1993) [11] Karp, R.M.: Reducibility among combinatorial problems, in Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, 85–103, Plenum Press, New York, 1972. [12] Liu, C.L., Layland, J.W.: Scheduling Algorithms for Multiprogramming in Hard-Real Time environment. Journal of ACM 20(1) (1973) 46–61 [13] McNaughton, R.: Scheduling with Deadline and loss Functions. Management Science 6(1) (1969) 1–12 [14] Mok, A.K.: Fundamental Design Problems of Distributed Systems for the Hard Real-Time Environment. PhD Thesis, Dept. of Electrical Engeneering and Computer Science, MIT Cambridge-Mass. (1983) [15] Muntz, R.R., Coffman, E.G. Jr: Preemptive Scheduling of Real-Time Tasks on Multiprocessor Systems. Journal of the ACM 17(2) (1970) 324–338 [16] El-Rewini, H., Lewis, T.G.: Scheduling Parallel Program Tasks onto Arbitrary Target Machines. Journal of Parallel and Distributed Computing 9 (1990) 138–153 [17] Rosen, B.: Function Optimization based on Advanced Simulated Annealing. IEEE Workshop on Physics and Computation - PhysComp 92 (1992) 289–293 [18] Sih, G.C., Lee, E.A.: Declustering: A New Multiprocessing Scheduling Technique. IEEE Transaction on Parallel and Distributed Systems 4(6) (1993) 625–637 [19] Sih, G.C., Lee, E.A.: A Compile Time Scheduling Heuristic for Interconnection-Constrained Heterogeneous Processor Architectures. IEEE Transaction on Parallel and Distributed Systems 4(2) (1993) 175–187 [20] Stankovic, J.A., Spuri, M., Di Natale, M., Buttazzo, G.C.: Implications of Classical Scheduling Results for Real-Time Systems. IEEE Computer 28(6) (1995) 16–25 [21] Tindell, K., Burns, A., Wellings, A.: Allocating Hard Real Time Tasks (An NP-Hard Problem Made Easy). Journal of Real-Time Systems 4(2) (1992) 145–165 [22] Yuret, D.: From Genetic Algorithm to Efficient Optimization. MS Thesis, Dept. of Electrical Engineering and Computer Science, MIT (1994) [23] Zhu, J., Lewis, T., Jackson, W., Wilson, R.: Scheduling In Hard Real-Time Applications. IEEE Software 12(3) (1995) 54–63

UBLCS-99-7

14

nk

ck

k

rk

nk

q0 q1 q2 q3 q4 q5 q6 q7

1.0 1.0 1.0 3.0 3.0 3.0 3.0 1.0

1 4 1 4 1 4 1 1

1.0893 2.0317 2.0317 5.9534 6.0472 6.0952 6.0472 1.0118

f0 f1 f2 f3 f4 f5 f6 f7 f8 f9

Total Time Qsort Deadline Limit Dead.s

10.9419 11.0000 9.3000

ah

Netch

Netrh

0.10 0.10 0.10 0.10 0.10

0.2461 0.2265 0.4413 0.1777 0.2723

q0 q1 q2 q3 q5

!q1 !q4 !q5 !q7 !q7

ck

k

1.0 0 4.0 2 4.0 0 4.0 0 4.0 2 4.0 0 4.0 2 4.0 0 4.0 2 1.0 0 Total Time FFT Deadline

ah f0 f0 f1 f1 f2 f2 f3 f3 f4 f4 f6 f8

!f1 !f4 !f5 !f7 !f6 !f8 !f6 !f8 !f5 !f7 !f9 !f9

rk

nk

ck

k

rk

1.0039 8.0629 8.2580 8.0000 8.1269 7.6992 8.1920 8.4628 7.9379 1.0158 19.8775 20.0000 18.6800

w0 w1 w2 w3 w4 w5 w6 w7

1.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0

5 5 5 3 3 3 5 5

1.4883 10.322 10.158 10.406 9.7709 5.3781 5.0793 5.0196

Netch

Netrh

0.16 0.16 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.16 0.16

0.3624 0.2864 0.0568 0.0316 0.0163 0.0568 0.1600 0.0882 0.2133 0.0656 0.5610 0.2786

Total Time WT Deadline

ah w0 w0 w2 w3 w4 w5

!w3 !w4 !w5 !w6 !w6 !w7

22.9034 23.0000 21.3600

Netch

Netrh

0.16 0.16 0.01 0.01 0.01 0.02

0.3034 0.4602 0.0360 0.0609 0.0752 0.0752

MAPPING FUNCTION f0

q0 1 q2 1

q1 4 q4

q3 4

f1

q5 1

2

f2

0

w0 5

0

f3

0

f4

2

5

5

w1

3

3

w2

w3

w5 3

5 w6

w4

q6 4

q7 1

QSort

NPROC = 6 Node mapped X on Processor X

1

f5

0

f6

2

f7

0

f8

2

f9 0

FFT Computation Processor Loads CP0 = 0.9960 CP1 = 0.9921 CP2 = 0.9921 CP3 = 0.9921 CP4 = 0.9960 CP5 = 0.9960

w7 5

WT Network Processor Loads NP0 = 1.0000 NP1 = 0.6328 NP2 = 0.8593 NP3 = 0.2968 NP4 = 0.9296 NP5 = 0.8750

Figure 7. Test involving different programs (an instance of QSort, an FFT and a Wavelet Transform.)

UBLCS-99-7

15

Suggest Documents