ALLOCATING AND SCHEDULING HARD REAL-TIME TASKS ON A PARALLEL PROCESSING PLATFORM A. Burns M. Nicholson K. Tindell N. Zhang Department of Computer Science, University of York, UK email:
[email protected]
ABSTRACT
This paper addresses the issues of scheduling and allocation/configuration of a point-topoint parallel system, for safety-critical hard real-time systems. Three specific topics are considered: an analysable computational model that has sufficient expressive power whilst retaining flexibility for allocation; a scheduling approach that allows the worst case response times for each system’s transactions to be calculated; and an allocation algorithm, based on the use of simulated annealing, that allow system level configuration to be undertaken. The DIA (Data Interaction Architecture) is used as an example parallel processing platform. A sizable case study is analysed to show the applicability of the techniques proposed. 1. INTRODUCTION Point-to-point architectures have a number of advantages over other forms of distribution: there is no physically shared communication media to schedule and they re-scale (off-line) with the minimum of disturbance. In this paper we consider the problems involved in running hard real-time safety-critical applications on point-to-point networks. An example of a point-to-point network is the DIA (Data Interaction Architecture)22. This model has been designed specifically to support real-time applications. In DIA, nodes are linked via dual port memories; the network is not fully connected and hence message routing needs to be considered. Each node consists of a processor and a KEC (Kernel Executive Chip); the role of the KEC is to manage the scheduling of work on the processor. This work is composed of processes that have assigned priorities. Dispatching is cooperative: a process will continue executing until it either blocks or executes a "voluntary suspend". When such a suspend is executed the KEC will undertake a process switch if there is a runnable process of greater or equal priority. Interrupts are handled by the KEC — the node processor is not interrupted. The dual port memories that link adjacent nodes implement (with the help of the two nodes) Simpson’s Algorithms21. These protocols allow read and write operations to occur concurrently without interference or blocking. In essence, the protocols ensure that a read operation can immediately receive the most recent fully written data item. The use of these algorithms (which do not need a common time frame) decouples the temporal behaviour of each node from its neighbours, removing the need for synchronisation algorithms. In this paper we address the issues of scheduling and allocation/configuration in a point-to-point distributed system. The DIA architecture is used as an exemplar; the analysis presented is however applicable to other point-to-point systems. We are concerned with the production of systems that will guarantee end-to-end timing deadlines. To achieve this we restrict our considerations to systems that do not perform process migration. Allocation is thus a pre-run-time activity. We do however consider the use of techniques, such as replication, to give fault tolerance. The paper is structured as follows. In the next section a general computational model is presented. Section 3 then covers the schedulability analysis. Allocation is addressed in section 4. A case study is described in section 5, and our conclusions are outlined in section 6.
2. COMPUTATIONAL MODEL The model used in this paper is derived from the design method MASCOT20 and the extensions used in DORIS23 and recognises the notion of process and two forms of Intercommunication Data Areas (IDAs) — pools and signals. The aim of this section is to introduce the basic concept of a transaction and discuss how different synchronisation protocols can be supported within the concept of a transaction by using pools and signals. A transaction can be defined as12: The single execution of a specified set of processes between a stimulus and the corresponding response. The stimulus can be the result of an external event (such as the arrival of data from a device) or an internal event (such as a clock tick or a message from a process). The processes execute in an order determined by a set of precedence constraints. Thus, the ’building blocks’ of transactions are groups of the form:
P1
IDA
P2
where P1 and P2 are processes and the IDA provides the mechanism for data to pass between them. If P1 and P2 are on different (but adjacent) nodes then the IDA will be mapped onto the dual port memory linking the two nodes. Processes such as P2 are constrained to be of the form: loop read data from an IDA perform some data transformation write data to an IDA end loop This restriction, to have the acquisition of data at the beginning of the process, is not too constraining. Any statements that don’t need the data can be executed as easily after the read as before, and other processes can be split, where necessary. These constraints ensure that processes can only be blocked at the beginning (or end) of their execution, if at all, which is deterministic and therefore the worst case response time of each transaction can be calculated. A process can voluntarily offer to give up the processor, by the explicit use of a SUSPEND instruction. At this rescheduling point a higher priority process can preempt the suspending process. The process can also go into a WAIT state at the beginning of its loop awaiting a stimulus. It is not the purpose of this paper to consider how the set of transactions is derived from the specification of requirements. Rather it is concerned with how a given set of transactions can be allocated and scheduled so that timing requirements are met. The remainder of this section reviews MASCOT’s IDAs. These will be used to illustrate how transactions can be constructed. 2.1. MASCOT’s IDAs The IDAs defined in the design method MASCOT (Modular Approach to Software Construction, Operation and Test) can take on three forms: channel, signal or pool. These allow end-to-end models of transactions to be constructed. In the following discussion we restrict our considerations to the use of only signals and pools; these are adequate for our needs and it is arguable that the channel is an inappropriate abstraction for real-time applications. Channels ensure that no data is lost, in many real-time applications only the most up to date data is required. Signals This type of IDA acts as a temporary repository for items of data en route from (say) process P1 to process P2. It is characterised by a destructive read operation. If the signal has a capacity it acts like a buffer, if not it acts like a "normal" signal. A process attempting to read from an empty signal will be blocked until data is available. Thus the act of writing to a signal embodies a stimulus that can wake a blocked process. The
write action is also destructive (i.e. it overwrites a full buffer and hence is non-blocking) We can represent the stimulus as a STIM action (note the symbols used for signal and pool in the following diagrams): STIM P1
P2
Data which ’travels’ through a signal is not changed in any way. The process P1 writes data into the signal and STIMs process P2 to indicate the data is available. P2 is now scheduled and can retrieve the data when it is next activated. There is a precedence relationship such that P2 must wait for P1 to write the data before it can continue execution. Thus, signals provide loosely synchronous (or weakly synchronous) intercommunication. If P2 is scheduled to read the IDA before P1 has overwritten the data then no data is lost. However in many real-time applications processes such as P2 are required to use the most up to date data available. Issues of data loss are therefore not applicable (which is why the MASCOT channel, which ensures that data is not lost, is not used). Pools A pool IDA consists of a collection of variables which are given initial values when the system commences execution and which may be subsequently examined and updated20. A read operation is thus nondestructive and non-blocking. It is also characterised by a destructive non-blocking write operation.
P1
P2
Since the reading process can extract data from the pool at any time this IDA is asynchronous. 2.2. Periodic and Sporadic Activities It is common in the real-time domain for activities to be considered either periodic or sporadic. Periodic processes are released by the tick of some system clock. Sporadic processes are released by some event. They are sporadic rather than aperiodic as there is a minimum time between two activations of the process (this allows worst case performance to be defined)2. At run-time there is no real distinction between a periodic and a sporadic process. Each is released by an event (although for periodic processes it is a tick event). In the model presented here a clock tick is modelled as a contentless signal. All processes are thus released by the STIM operation as part of a signal. As the write operation on a signal or pool is non-blocking the only action that can prevent a process from continuing is a read operation on a signal. A periodic process thus "reads" the clock signal at the start of its period. It will be released when its next period is due. Processes are constrained to have a single signal read operation. They may have any number of signal write operations or read and write pool operations. 2.3. Transaction Models The IDAs discussed above allow two end-to-end timing abstractions for transactions to be defined. These are the asynchronous model (using pools) and the loosely synchronous model (using signals). Combinations are also possible:
STIM1 P1
clock STIM
STIM2 P2
P3
P4
clock STIM
The dashed lines represent data exchange with the environment. In this example P1 is released by a clock action (it is a periodic activity). Data is then read from some local sensor and passed on to P2. Processes P2 and P3 are released by the arrival of the data with the signal (they are sporadic activities); they both transform the data, in some application defined way, before passing it on. A pool is used for communication between P3 and P4. Hence P4 is also a periodic process: it will be released by a later clock STIM. The interval between the two clock STIMs (for P1 and P4) should be chosen so that P1, P2 and P3 will have completed their executions before P4 executes. P4 will thus have a very regular behaviour with little release jitter on the data it outputs to its local actuator. 2.4. Safety Design Idioms In safety-critical hard real-time systems safety as well as timing issues need to be addressed. In this section we will consider a number of safety design idioms (SDI) † that can be applied to elements of a DIA system to overcome, or mask, identified failure modes. The use of SDIs will have an impact on the resource usage and timing attributes of a system. These must be taken into account in allocating and scheduling elements of a design to the hardware infrastructure. They will also have an impact on the safety characteristics of the proposed system, which can be indicated by measures of system reliability. Two assumptions are made about the failures that a system will encounter. First, software failure rates are given by the underlying hardware failure rates. Second, failures are independent. Research is being undertaken to remove these assumptions through the exploratory failure modes analysis technique, SHARD17, (Software Hazard Analysis and Resolution in Design) and FPTN (Failure Propagation and Transformation Notation)8. A number of SDIs could be applied to DIA systems. However, the computational model of DIA indicates that forward error correcting approaches should be used. In forward error correction it is the receiver/reader alone that detects and corrects an error. Three idioms that appear particularly useful are watchdogs, safety kernels5 and replication15. Watchdogs can be applied to tasks or full transactions. They increase the worst case execution times of tasks or transactions by the duration of their start up and any remidial actions that are undertaken if a time out occurs. In DIA the timing can be undertaken by the kernel chip and a task interrupted if it overuns its’ worst case slice time. Safety Kernels should have little impact on the timing and resource usage of a system as their actions can be undertaken on the kernel chips, which run in parallel with the main processing elements. Replication however requires duplicate, and extra, tasks to be executed. Thus replication is worth considering in some detail. Tasks and transactions can be replicated so that the failure of any single processing element or link can be tolerated. An exact copy of a task is placed on a seperate processor. Different protocols can be produced to overcome a range of potential failures. Elements can be assumed to fail silently or to always produce results (which may be erroneous). Faults can be transient or permanent. In Nicholson15 three protocols are considered, AND, OR and two from three. Consider, by way of an example, the replication of a transaction, where messages may fail silently. The configuration of such a replicated transaction is shown below. The transaction consists of tasks A, B and C; tasks D and E are replicas. Tasks F and G are extra tasks that determine which result is to be passed onto the transaction output task H, and are placed on the same processor. Safety design idioms can be defined as any algorithm, technique or structure which defends against some class(es) of potentially safety significant event.
B
C
F
A
H D
G
E
processor
The replica tasks D and E must not be placed on the same processors as the corresponding original tasks (B, and C). Furthermore non of the links used to connect tasks A,B,C,F must be used to connect tasks A,D,E,G. 3. SCHEDULABILITY ANALYSIS As indicated in the introduction we assume a static allocation of processes to nodes. The next section describes how this is undertaken. In this section we introduce the analysis that will enable an allocation to be assessed. Having allocated each process to a node, and assigned a priority to each process, the schedulability analysis will indicate what are the worst case response times for (sub-)transactions on that node. Thus full system wide response times can be checked by adding together the response times of each sub-transaction. We assume that each process on each node is characterised by its worst case execution time, C , and its minimum interarrival time, T . For periodic processes the release rate is fixed and hence the value of T is easily obtained. For a sporadic process the T value is taken, in this section, to be the minimum interval between any two releases of the process. In the next subsection this (pessimistic) assumption will be reassessed. The basis of the analysis is the assumption that the execution environment provides priority based preemptive scheduling. The preemption can be deferred (as in the DIA architecture) but there must be an upper bound on the time before a context switch is performed (if a higher priority process now wishes to execute). This upper bound is the maximum blocking time any process can experience, and is denoted by B is the following analysis. The schedulability test calculates the longest response time, R , that each process can experience. This is expressed as follows:
R = C + I + B where C is the worst case computation time, I is the total computation time (interference) higher priority processes can generate in any interval (t, t+R ] (for arbitrary time t); and B is the blocking time that the process suffers from lower priority processes (with a deferred preemption model this is the maximum deferred time). Note that as with the priority ceiling protocol19 the lowest priority process does not suffer a block (although it does suffer the maximum interference). An expression for I comes from consideration of all higher priority processes. The general equation for process Pi is as follows (note the larger the value of i the lower the priority; also that P 1, the highest priority process on this node, is analysed first):
Ri = Ci +
The ceiling quantity
Ri
Tj
i −1
Σ
j =1
Ri
Tj
Cj + B
(1)
indicates how often process P j will execute in the interval of interest (0, Ri ].
By multipling this value by C j the total computational interference Pi suffers from process P j is obtained. Equation (1), without the blocking factor, was originally derived by Joseph and Pandya13; and is considered in detail by Audsley et al3.
The response time of the highest priority process P 1, which suffers no interference, is given by: R 1 = C 1 + B . For other processes equation (1) is recursive in Ri ; it can be solved by giving an initial estimate to Ri of (Ri −1 + Ci − B ) and then iterating around on the calculated values of the right hand side of the equation:
Rin −1
i −1
Σ
Rin = Ci +
Tj
j =1
Cj + B
(2)
In general there may be more than one solution to equation (1). The smallest value of Ri is obtained if the initial estimate for the iteration is less than this required value. The above initial estimates ensure this. Equation (1) always has a solution if the utilisation of the node is not greater than 100%; i.e. n
Ck
Σ Tk
k =1
≤ 1
where n is the number of processes on that node. If one considers the execution of the process set up to the LCM of the process periods then all processes will get their execution requirements unless total utilisation is more than the LCM value. However the normal requirement is for each process to have its worst case response time no greater than its periods. If the response time for a process is greater than its period then it is possible for the process to interfere with itself. Equation (1) is no longer valid in this situation. To illustrate the use of equation (1) consider the simple three process set given in Table 1. Let the blocking time be 4 units of computation.
Period Computation Priority
Time Task_1 16 4 1 24 6 2 Task_2 Task_3 40 14 3
Table 1: Example Process Set Task_1 has the highest priority and has a response time of 8. Task_2 has an earliest possible response time of 14; putting this value into equation (1) gives a right hand side value of 14, hence 14 is indeed the worst case response time. For Task_3 the initial estimate is 24 (14+14-4); the left hand side of equation (1) is thus (note the blocking factor is zero as this is the lowest priority process):
This yields a value of 28. Hence:
14 +
28
16
4 +
24 6 24
24 4 + 16
14 +
28
24
6
which equates to 34. Another iteration gives a value of 38; this value is stable (i.e. causes equation (1) to actually balance) and hence the actual worst response time of Task_3 is 38. Equation (1) is sufficient but not necessary. The response times derived by this equation are always greater than (or equal) to those experienced at run time. If the blocking factor B is zero the equation becomes sufficient and necessary. For non-zero B it is possible to construct process sets that will behave slightly better than predicted by equation (1).
3.1. Improved Schedulability Analysis 3.1.1. Release Jitter To undertake the required scheduling analysis for a mixture of periodic and sporadic activities requires there to be a maximum load exerted on each node by each loosely synchronous transaction. Unfortunately sporadic processes suffer from release jitter. In general, loosely synchronous transactions have the property that a second invocation (release by a clock STIM) of a transaction can "catch up" with the previous earlier one. For example a process, P, with a period of 10 and a worst case response time of 9 could generate two STIMs 1 unit of time apart. To test for schedulability the maximum computational interference must be calculated. For processes suffering interference from P there are three choices: Assume arrival interval of 10 — this will underestimate the interference from P and cause the analysis to be not sufficient. Assume arrival interval of 9 — this will overestimate the interference from P and cause the analysis to be pessimistic. Assume first interval is 9 and subsequent ones are 10 — analysis is sufficient and not unduly pessimistic. Clearly the third approach is preferred. Equation (1) can thus be modified to reflect this2 :
Ri = Ci +
Ri + J j
i −1
Σ
Cj + B
Tj
j =1
(3)
Here J j is the maximum jitter experienced by process j (9 in the above discussion). 3.1.2. Deferred Preemption Equation (1) has a blocking factor that accounts for the maximum time a lower priority process can be executing in the interval (0, Ri ]. The interference factor similarly accommodates all possible releases of higher priority processes in this interval. With deferred preemption however this is pessimistic as the process cannot be preempted in the last phase of its execution. Let Fi represent this last non-preemptable phase (i.e. the computation time between offering the last preemption (suspend) and the completion of the process’s execution). Interference can now only occur in the interval (0, Ri −Fi ]. Within this interval the process in question needs to execute for Ci −Fi . Let Ri be the worst case response time for the process to execute Ci −Fi . By equation (1):
Ri = Ci − Fi + With
i −1
Σ
j =1
Ri
Cj + B
Tj
(4)
Ri = Ri + Fi
(5)
Note that equation (3) could also have been used if release jitter is present. Strictly, Fi must be sufficiently short to ensure that the last phase has actually started by Ri ; hence Fi < B (where B is the maximum length of deferred preemption). Consider the three process set defined by Table 1. The application of equation (1) gave a response time for Task_3 of 38. Let F 3 have a value of 3. Equation (4) thus becomes:
R 3 = 11 +
R3
R3
16
4 +
24
6
An initial estimate of R 3 equal to 24 produces a new value of 25; this in turn produces a value of 31 which balances the equation, and hence by equation (5) R 3 is equal to 34. This represents a significant reduction. Table 2 summarises the response time predictions for equation (1) and equation (5).
T C F R (1) R (5) Task_1 16 4 3 8 8 Task_2 24 6 3 14 14 Task_3 40 14 3 38 34
Table 2: Predicted Response Times 3.2. Summary The above analysis is applicable to all statically allocated systems (including single processor systems). It is used specifically in the following section for point-to-point architectures. The DIA platform, which is being used as our examplar point-to-point architecture, employs deferred preemption to ease implementation and to provide mutual exclusion over internal data. This results in a single blocking factor, B , being used in equation (1). In a more conventional run-time kernel ceiling priority levels may be used to give mutual exclusion and bounded blocking19. Rather than have a constant factor, B , in equation (1) a factor that is process dependent, Bi , must be used. This value is obtained by analysing the critical sections of process; however it does not make a significant impact on the form of equation (1) or the extensions that can be derived from it. A final point of clarification is perhaps useful. Although offsets can be used to implement asynchronous transactions they play no part in the analysis of timing properties. Equations (1) and (2) assume that there exists a critical instant at which all processes are released together. This instant represents the maximum load on the processor and hence this is when the worst case response times are obtained. If a critical instant does not occur the equations become sufficient but not necessary (i.e. pessimistic). Offsets lead to process sets that have no critical instant when the processes offset to one another are on the same processor. With asynchronous transactions the offsets are between processes that can be on different processors and hence it is acceptable to assume the safe property that a critical instant does occur. 4. ALLOCATING TRANSACTIONS In a general distributed system a transaction may only be loosely bound to the available nodes, a priori, although at run time it is fixed. Some processes may be constrained to run on particular nodes (for example input and output processes), the rest are able to execute anywhere. Other restrictions on allocation are also possible: if transaction replication is used for fault recognition or increased availability then there will be a need to keep replicas apart. The global scheduling problem involves allocating processes so that all constraints are satisfied and all deadlines are met. Such an allocation is called feasible. If more than one feasible allocation exists then an optimal one has some further parameter maximised. However it must be remembered that in general the allocation problem is NP-hard4. Where a shared communication medium is used the allocation problem has a fixed set of transactions and processes to deal with. Recent results have shown that the use of simulated annealing is appropriate for this allocation problem6. When a point-to-point architecture is involved the situation is not as straightforward. Some transactions are only feasible if they are extended so that they can be mapped onto the available hardware. For example an input process linked directly to an output process cannot be supported if the two processes are bound to nodes that do not have a direct link. In such cases extra processes and IDAs are required for routing. The following sections indicate how the asynchronous and loosely synchronous models can be extended.
4.1. Asynchronous Routing An asynchronous transaction uses pools:
B
A
Process B is periodic with an appropriate offset to allow it to read the data in the pool before it is overwritten. To further distribute this transaction requires extra processes (on intermediate nodes) that will route the data. By using a signal with a STIM the original semantics are preserved: STIM
AGT
A
B
The deadline of the agent process (AGT) is chosen to ensure the overall end-to-end timing requirement is still met. 4.2. Loosely Synchronous Routing A loosely synchronous transaction has a signal between each process. Consider the following simple two process sporadic transaction: STIM
A
B
If this must be implemented via an intermediate node then a new process must be created which when STIMed will read data from one signal and immediately write to another (and then STIM the reader): STIM1
A
STIM2
AGT
B
The semantics of the original signal transaction are preserved and the end-to-end deadline is met if the new process set is given appropriate deadlines. 4.3. The Allocation Procedure Allocation involves mapping a fixed set of transactions to a fixed point-to-point architecture of nodes†. In general not all nodes are joined directly to each other. End-to-end deadlines must be met but the deadlines of individual processes can be fixed as part of the allocation procedure. For periodic transactions offsets will also be set (although they are derived from the response times of earlier processes). To facilitate allocation, routing may be needed; this will take the form of adding extra processes (where necessary) as outlined above. The output from the allocation process will be (if a valid allocation has been found): (a) The process set for each transaction (i.e. the original set plus any new ones created for routing).
†If the hardware architecture is not fixed but configurable or extendible then this gives extra degrees of freedom to the allocation procedure.
(b)
The allocation (mapping) of each process to a node.
(c)
The priority of each process.
On each node a schedulability test will be undertaken to ensure that all local deadlines will be met. The blocking time (i.e. the period of maximum deferred preemption) is an attribute of each process and equation (1) will need to be modified so that the blocking factor is taken to be the maximum interval of deferred preemption of lower priority processes on that node. The choice of priority in effect determines the response times of the processes. Rather than have the simulated annealing algorithm choose intermediate deadlines directly it has been found to be more effective to have the algorithm choose priority and then let the scheduling formulas give the worst case response times. These can then be checked against acceptable deadlines. The model is expanded to include any replicas, or extra tasks, implied by the use of replication, or other SDI’s. The input file indicates which transactions are to be replicated. The allocation process automatically creates an appropriate number of new tasks and tags them so that they can be kept seperate from the original tasks. 4.4. The Application of Simulated Annealing Process allocation can be viewed as a global combinatorial optimisation problem. It is similar, in nature, to other problems found in computer science, such as the travelling salesman problem. These problems have been successfully tackled by neighbourhood search techniques such as simulated annealing10, 1, 9. This particular technique attempts to find the lowest point in an energy landscape. The distinctive feature of the algorithm is that it incorporates random jumps to potential new solutions. This ability is controlled and reduced as the algorithm progresses. In order to describe the algorithm some definitions are needed. The set of all possible allocations and process attributes (e.g. priority) for a given set of processes and nodes is called the problem space. A point in the problem space is a mapping of processes to nodes and a set of particular process attributes. The neighbour space of a point is the set of all points that are reachable by permuting the characteristics of the point by some algorithm (e.g. moving a single process to another node). The energy of a point is a measure of the suitability of the allocation and attribute set represented by that point (poor allocations are high energy points). The energy function, with parameters, determines the shape of the problem space — it can be visualised as a rugged landscape, with deep valleys representing good solutions, and high peaks representing poor or infeasible ones. The allocation problem is that of finding the lowest energy point in the problem space. The annealing algorithm works on single solution points, so a random starting point is chosen, and the energy, Es , evaluated. A random point in the neighbour space is chosen, and the energy, En , evaluated. A logistic acceptance criteria is applied. That is the new solution is accepted if either En ≤ Es , or if:
e x ≥ random(0,1) Where
Es − En x = CV CV is the control variable, and ‘random’ is a uniform random number generator. During the annealing process CV is slowly reduced (‘cooling’ the system), making jumps to higher value solutions less likely. Eventually, the system ‘freezes’ into a low energy state which will be close to the optimum value. The basic structure of the algorithm is as follows:
choose random starting point P 0 choose starting temperature CV 0 repeat repeat EP := Energy at point Pn choose T , a neighbour of Pn ET := Energy at point T if ET < EP then
Pn
+1
else
x :=
=T
E −E
P T
CVn if e ≥ random(0,1) then Pn + 1 := T x
else
Pn
+1
:= Pn
fi fi until thermal equilibrium
CVn
+1
= f (CVn )
until some stopping criterion The initial temperature, CV 0, is chosen automatically by the algorithm 10 so that virtually all proposed jumps are taken. A low temperature is picked and doubled until the acceptance ratio (the number of accepted jumps over the number of proposed jumps) is near to 100%. The temperature decrease function, f (CVn ), is usually a simple multiplication by α, where 0≤α